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Preface 



The papers contained in this volume were presented at ICDT 2001, the 8th 
International Conference on Database Theory, held from 4 to 6 January 2001 in 
the Senate House of Birckbeck College, University of London, UK. 

The series of ICDT conferences provides a biennial, international forum for 
the communication of research advances on the principles of database systems. 
ICDT is traditionally held in beautiful European locations: Rome in 1986, Bru- 
ges in 1988, Paris in 1990, Berlin in 1992, Prague in 1995, Delphi in 1997, and 
Jerusalem in 1999. Since 1992, ICDT has merged with the Symposium on Ma- 
thematical Fundamentals of Database Systems (MFDBS), initiated in Dresden 
in 1987, and continued in Visegrad in 1989 and Rostock in 1991. 

This volume contains 26 papers describing original research on fundamental 
aspects of database systems, selected from 75 submissions. The paper by Chung 
Keung Poon was awarded as Best Newcomer to the field of database theory. In 
addition, this volume contains two invited papers by Leonid Libkin and Philip 
Wadler. A third invited talk was given by Andrei Broder. 

We wish to thank all the authors who submitted papers, the members of the 
program committee, the external referees, the organizing committee, and the 
sponsors for their efforts and support. 
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Expressive Power of SQL 



Leonid Libkin^ 

University of Toronto and Bell Laboratories 
Email: libkin@cs.toronto.edu 



Abstract. It is a folk result in database theory that SQL cannot ex- 
press recursive queries such as reachability; in fact, a new construct was 
added to SQL3 to overcome this limitation. However, the evidence for 
this claim is usually given in the form of a reference to a proof that re- 
lational algebra cannot express such queries. SQL, on the other hand, in 
all its implementations has three features that fundamentally distinguish 
it from relational algebra: namely, grouping, arithmetic operations, and 
aggregation. 

In the past few years, most questions about the additional power pro- 
vided by these features have been answered. This paper surveys those 
results, and presents new simple and self-contained proofs of the main 
results on the expressive power of SQL. Somewhat surprisingly, tiny dif- 
ferences in the language definition affect the results in a dramatic way: 
under some very natural assumptions, it can be proved that SQL cannot 
define recursive queries, no matter what aggregate functions and arith- 
metic operations are allowed. But relaxing these assumptions just a tiny 
bit makes the problem of proving expressivity bounds for SQL as hard 
as some long-standing open problems in complexity theory. 



1 Introduction 

What queries can one express in SQL? Perhaps more importantly, one would like 
to know what queries cannot be expressed in SQL ~ after all, it is the inability to 
express certain properties that motivates language designers to add new features 
(at least one hopes that this is the case). 

This seems to be a rather basic question that database theoreticians should 
have produced an answer to by the beginning of the 3rd millennium. After all, 
we’ve been studying the expressive power of query languages for some 20 years 
now (and in fact more than that, if you count earlier papers by logicians on 
the expressiveness of first-order logic), and SQL is the de-facto standard of the 
commercial database world ~ so there surely must be an answer somewhere in 
the literature. 

When one thinks of the limitations of SQL, its inability to express reachability 
queries comes to mind, as it is well documented in the literature (in fact, in 
many database books written for very different audiences, e.g. [1,5,7,25]). Let us 
consider a simple example: suppose that R(Src,Dest) is a relation with flight 
information: Src stands for source, and Dest for destination. To find pairs of 
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cities {A, B) such that it is possible to fly from A to B with one stop, one would 
use a self-join: 

SELECT R1 . Src , R2.Dest 

FROM R AS Rl, R AS R2 

WHERE Rl.Dest=R2.Src 

What if we want pairs of cities such that one makes two stops on the way? Then 
we do a more complicated self-join: 

SELECT Rl . Src , RS.Dest 

FROM R AS Rl, R AS R2, R AS R3 

WHERE Rl .Dest=R2 . Src AND R2.Dest=R3.Src 

Taking the union of these two and the relation R itself we would get the pairs of 
cities such that one can fly from A to B with at most two stops. But often one 
needs a general reachability query in which no a priori bound on the number of 
stops is known; that is, whether it possible to get to B from A. 

Graph-theoretically, this means computing the transitive closure of R. It is 
well known that the transitive closure of a graph is not expressible in relational 
algebra or calculus; in particular, expressions similar to those above (which hap- 
pen to be unions of conjunctive queries) cannot possibly express it. This appears 
to be a folk result in the database community; while many papers do refer to [2] 
or some other source on the expressive power of first-order logic, many texts just 
state that relational algebra, calculus and SQL cannot express recursive queries 
such as reachability. 

With this limitation in mind, the SQL3 standard introduced recursion expli- 
citly into the language [7,12]. One would write the reachability query as 

WITH RECURSIVE TrCI (Src ,Dest) AS 
R 

UNION 

SELECT TrCI. Src, R.Dest 
FROM TrCI , R 
WHERE TrCI. Best = R.Src 
SELECT * FROM TrCI 

This simply models the usual datalog rules for transitive closure: 

trcl{x,y) :- r{x,y) 
trcl{x,y) :- trcl{x, z),r{z,y) 

When a new construct is added to a language, a good reason must exist 
for it, especially if the language is a declarative query language, with a small 
number of constructs, and with programmers relying heavily on its optimizer. 
The reason for introducing recursion in the next SQL standard is precisely this 
folk result stating that it cannot be expressed in the language. But when one 
looks at what evidence is provided to support this claim, one notices that all 
the references point to papers in which it is proved that relational algebra and 
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calculus cannot express recursive queries. Why is this not sufficient? Consider 
the following query 

SELECT Rl.A 

FROM R1 , R2 

WHERE (SELECT COUNT (*) FROM Rl) > 

(SELECT COUNT (*) FROM R2) 

This query tests if | Rl |>| R2 |: in that case, it returns the A attribute of Rl, 
otherwise it returns the empty set. However, logicians proved it long time ago 
that first-order logic, and thus relational calculus, cannot compare cardinalities 
of relations, and yet we have a very simple SQL query doing precisely that. 

The conclusion, of course, is that SQL has more power than relational alge- 
bra, and the main source of this additional power is its aggregation and grouping 
constructs, together with arithmetic operations on numerical attributes. But 
then one cannot say that the transitive closure query is not expressible in SQL 
simply because it is inexpressible in relational algebra. Thus, it might appear 
that the folk theorem about recursion and SQL is an unproven statement. 

Fortunately, this is not the case: the statement was (partially) proved in the 
past few years; in fact, a series of papers proved progressively stronger results, 
finally establishing good bounds on the expressiveness of SQL. 

My main goal here is twofold: 

(a) I give an overview of these recent results on the expressiveness of SQL. We 
shall see that some tiny differences in the language definition affect the results 
in a dramatic way: under some assumptions, it can be shown that reachability 
and many other recursive queries aren’t expressible in SQL. However, under 
a slightly different set of assumptions, the problem of proving expressivity 
bounds for SQL is as hard as separating some complexity classes. 

(b) Due to a variety of reasons, even the simplest proofs of expressivity results 
for SQL are not easy to follow; partly this is due to the fact that most papers 
used the setting of their predecessors that had unnecessary complications in 
the form of nested relations, somewhat unusual (for mainstream database 
people) languages and infinitary logics. Here I try to get rid of those compli- 
cations, and present a simple and self-contained proof of expressivity bounds 
for SQL. 

Organization. In the next section, we discuss the main features that distin- 
guish SQL from relational algebra, in particular, aggregate functions. We then 
give a brief overview of the literature on the expressive power of SQL. 

Starting with Section 3, we present those results in more detail. We intro- 
duce relational algebra with grouping and aggregates, ALGaggr, that essentially 
captures basic SQL statements. Section 4 states the main result on the expres- 
sive power of SQL, namely that queries it can express are local. If one thinks 
of queries on graphs, it means that the decision whether a tuple t belongs to 
the output is determined by a small neighborhood of t in the input graph; the 
reachability query does not have this property. 
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Section 5 defines an aggregate logic £aggr and shows a simple translation 
of the algebra with aggregates ALGaggr into this logic. Then, in Section 6, we 
present a self-contained proof of locality of /laggr (and thus of ALGaggr) • 

In Section 7, we consider an extension ALG^ggj. of ALGaggr in which non- 
numerical order comparisons are allowed, and show that it is more powerful than 
the unordered version. Furthermore, no nontrivial bounds on the expressiveness 
of this language can be proved without answering some deep open problems in 
complexity theory. 

Section 8 gives a summary and concluding remarks. 



2 SQL vs. Relational Algebra 

What exactly is SQL? There is, of course, a very long standard, that lists nu- 
merous features, most of which have very little to do with the expressiveness of 
queries. As far as expressiveness is concerned, the main features that distinguish 
SQL from relational algebra, are the following: 

— Aggregate functions: one can compute, for example, the average value in a 
column. The standard aggregates in SQL are COUNT, TOTAL, AVG, MIN, MAX. 

— Grouping: not only can one compute aggregates, one can also group them 
by values of different attributes. For example, it is possible to compute the 
average salary for each department. 

— Arithmetic: SQL allows one to apply arithmetic operations to numerical 
values. 

For example, for relations SI (Empl ,Dept) and S2(Empl, Salary), the follo- 
wing query (assuming that Empl is a key for both relations) computes the average 
salary for each department which pays total salary at least 100,000: 

SELECT SI. Dept, AVG (S2. Salary) 

FROM SI, S2 

(*) WHERE Sl.Empl=S2.Empl 

GROUPBY SI. Dept 

HAVING TOTAL (S2. Salary) > 100000 
Next, we address the following question: what is an aggregate function? The 
first paper to look into this was probably [20]: it defined aggregate functions 
as / : 77. — >■ Num, where 77 is the set of all relations, and Num is a numerical 
domain. A problem with this approach is that it requires a different aggregate 
function for each relation and each numerical attribute in it; that is, we do not 
have just one aggregate AVG, but infinitely many of those. This complication 
arises from dealing with duplicates in a correct manner. However, duplicates can 
be incorporated in a much more elegant way, as suggested in [14], which we shall 
follow here. According to [14], an aggregate function iF is a collection 



where fk is a function that takes a fc-element multiset (bag) of elements of Num 
and produces an element of Num. For technical reasons, we also add a constant 
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ft^ G Num whose intended meaning is the value of T on infinite multisets. For 
example, if Num is N, or Q, or K, we define the aggregate ^ = {sq, • • ■} by 
Sfc({|a;i, . . . , a;fe|}) = furthermore, sq = Su, = 0 (we use the {| |} brackets 

for multisets). This corresponds to SQL’s TOTAL. For COUNT, one defines C = 
{cq,ci, . . .} with Cfc returning k (we may again assume Cu; = 0). The aggregate 
AVG is defined as ^ = {oq, Oi, . . .} with ak{X) = , qq = = 0. 



Languages That Model SQL and Their Expressive Power 

It is very hard to prove formal statements about a language like SQL: to put 
it mildly, its syntax is not very easy to reason about. The research community 
has come up with several proposals of languages that capture the expressiveness 
of SQL. The earliest one is perhaps King’s extension of relational algebra by 
grouping and aggregation [20] : if e is an expression producing a relation with m 
attributes, A is a set of attributes, and / is an aggregate function, then e{A, f) 
is a new expression that produces a relation with to + 1 attributes. Assuming / 
applies to attribute A', and B is the list of all attributes of the output of e, the 
semantics is best explained by SQL: 

SELECT B, f{A') 

FROM e 
GROUPBY A 

King’s paper did not analyze the expressive power of this algebra, nor did it show 
how to incorporate arithmetic operations. The main contribution of [20] is an 
equivalence result between the algebra and an extension of relational calculus. 
However, the main focus of that extension is its safety, and the resulting logic is 
extremely hard to deal with, due to many syntactic restrictions. 

To the best of my knowledge, the first paper that directly addressed the pro- 
blem of the expressive power of SQL, was the paper by Consens and Mendelzon 
in ICDT’90 [6]. They have a datalog-like language, whose nonrecursive fragment 
is exactly as expressive as King’s algebra. Then they show that this language 
cannot express the transitive closure query under the assumption that DLOGS- 
PACE is properly included in NLOGSPACE. The reason is simple: King’s al- 
gebra (with some simple aggregates) can be evaluated in DLOGSPAGE, while 
transitive closure is complete for NLOGSPAGE. 

That result can be viewed as a strong evidence that SQL is indeed incapable 
of expressing reachability queries. However, it is not completely satisfactory for 
three reasons. First, nobody knows how to separate complexity classes. Second, 
what if one adds more complex aggregates that increase the complexity of query 
evaluation? And third, what if the input graph has a very simple structure (for 
example, no node has outdegree more than 1)7 In this case reachability is in 
DLOGSPAGE, and the argument of [6] does not work. 

In early 90s, many people were looking into languages for collection types. 
Functional statically typechecked query languages became quite fashionable, and 
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they were produced in all kinds of flavors, depending on particular collection ty- 
pes they had to support. It turned out that a set language capturing essentially 
the expressive power of a language for bags, could also model all the essential 
features of SQL [23] . The problem was that the language dealt with nested relati- 
ons, or complex objects. But then [23] proved a conservativity result, stating that 
nested relations aren’t really needed if the input and output don’t have them. 
That made it possible to use a non-nested fragment of languages inspired by 
structural recursion [4] and comprehensions [28] as a “theoretical reconstruction 
of SQL.” 

Several papers dealt with this language, and proved a number of expressivity 
bounds. The first one, appearing in PODS’94 [23], showed that the language 
could not express reachability queries. The proof, however, was very far from 
ideal. It only proved inexpressibility of transitive closure in a way that was very 
unlikely to extend to other queries. It relied on a complicated syntactic rewriting 
that wouldn’t work even for a slightly different language. And the proof wouldn’t 
work if one added more aggregate functions. 

The first limitation was addressed in [8] where a certain general property of 
queries expressible in SQL was established. However, the other two problems not 
only remained, but were exacerbated: the rewriting of queries became particu- 
larly unpleasant. In an attempt to remedy this, [21] gave an indirect encoding of 
a fragment of SQL into first-order logic with counting, FO(C) (it will be formally 
defined later) . The restriction was to natural numbers, thus excluding aggregates 
such as AVG. The encoding is bound to be indirect, since SQL is capable of ex- 
pressing queries that FO(C) cannot express. The encoding showed that for any 
query Q in SQL, there exists a FO(C) query Q' that shares some nice properties 
with Q. Then [21] established some properties of FO(C) queries and transferred 
them to that fragment of SQL. The proof was much cleaner than the proofs of 
[23,8], at the expense of a less expressive language. 

After that, [24] showed that the coding technique can be extended to SQL 
with rational numbers and the usual arithmetic operations. The price to pay was 
the readability of the proof - the encoding part became very unpleasant. 

That was a good time to pause and see what must be done differently. How 
do we prove expressivity bounds for relational algebra? We do it by proving bo- 
unds on the expressiveness of first-order logic (FO) over finite structures, since 
relational algebra has the same power as FO. So perhaps if we could put aggre- 
gates and arithmetic directly into logic, we would be able to prove expressivity 
bounds in a nice and simple way? 

That program was carried out in [18], and I’ll survey the results below. One 
problem with [18] is that it inherited too much unnecessary machinery from its 
predecessors [23,8,24,21,22]: one had to deal with languages for complex objects 
and apply conservativity results to get down to SQL; logics were inflnitary to 
start with, although inflnitary connectives were not necessary to translate SQL; 
and expressivity proofs went via a special kind of games invented elsewhere [16]. 

Here we show that all these complications are completely unnecessary: there 
is indeed a very simple proof that reachability is not expressible in SQL, and 
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this proof will be presented below. Our language is a slight extension of King’s 
algebra (no nesting!). We translate it into an aggregate logic (with no infinitary 
connectives!) and prove that it has nice locality properties (without using games!) 

3 Relational Algebra with Aggregates 

To deal with aggregation, we must distinguish numerical columns (to which 
aggregates can be applied) from non-numerical ones. We do it by typing: a type 
of a relation is simply a list of types of its attributes. 

We assume that there are two base types: a non-numerical type b with domain 
Dorn, and a numerical type n, whose domain is denoted by Num (it could be 
N, Z,Q,R, for example). 

A type of a relation is a string over the alphabet {b,n}. A relation R of 
type a\ . . . am has m columns, the ith one containing entries of type a^. In other 
words, such a relation is a finite subset of 

m 

Y[dom{ai) 

where dom(b) = Dorn and dom(n) = Num. For example, the type of S2(Empl, 
Salary) is bn. For a type t, t.i denotes the zth position in the string. The length 
of t is denoted by \t\. 

A database schema SC is a collection of relation names Ri and their types 
ti, we write Ri : U if the type of Ri is C. 

Next we define expressions of relational algebra with aggregates, parameteri- 
zed by a collection Q of functions and predicates on Num, and a collection 0 of 
aggregates, over a given schema SC. Expressions are divided into three groups: 
the standard relational algebra, arithmetic, and aggregation/grouping. In what 
follows, m stands for 1 1 1, and {ii, . . . , for a sequence 1 < zi < . . . < < m. 

Relational Algebra 

Schema Relation If i? : t is in SC, then R is an expression of type t. 
Permutation If e is an expression of type t and 0 is a permutation of {1 , . . . , 
to}, then pe{e) is an expression of type 0 {t). 

Boolean Operations If ei, C2 are expressions of type t, then so are eiUe2, eifl 

62, 6i — 62. 

Cartesian Product For ei : t\, 62 : <2, ei x 62 is an expression of type t\ ■ t2- 
Projection If e is of type t, then 7 rij_,.._jj.(e) is an expression of type t' where 
t' is the string composed of t.ijS, in their order. 

Selection If e is an expression of type t, i,j < to, and t.i = t.j, then ai^j{e) 
is an expression of type t. 

Arithmetic 

Numerical Selection If P C Num^ is a fc-ary numerical predicate from [ 2 , 
and ii, . . . ,ik are such that t.ij = n, then <j[P]ii,...,ii,{e) is an expression of 
type t for any expression e of type t. 
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Function Application If / : Num* — >■ Num is a function from f 2 , zi, . . . 
are such that t.ij = n, and e is an expression of type t, then Apply (e) 

is an expression of type t • n. If A: = 0 (i.e. / is a constant), then Apply [/]g(e) 
is an expression of type t ■ n. 

Aggregation and Grouping 

Aggregation Let T be an aggregate from 0 . For any expression e of type t 
and i such that t.i = n, Aggr[z : iF](e) is an expression of type t ■ n. 

Grouping Assume e : u is an expression over SC U {S : s}. Let e' be an 
expression of type t ■ s over SC, where 1 1 1 = 1 . Then Group; [AS'.e](e') is an 
expression of type t ■ u. 



Semantics. For the relational algebra operations, this is standard. The opera- 
tion pg is permutation: each tuple (oi, . . . , Um) is replaced by (ae(i)> • ■ • ; oe(m))- 
The condition z = j in the selection predicate means equality of the zth and 
the jth attribute: {ai, . . . ,am) is selected if Ui = aj. Note that using Boolean 
operations we can model arbitrary combinations of equalities and disequalities 
among attributes. 

For numerical selection, selects (ai,...,am) iff . . . ,Qi^) 

holds. Function application replaces each (ai,...,am) with (ai,...,am, 

f (Oii , • ■ • , Rife ))• 

The aggregate operation is SQL SELECT A,T{Ai) FROM e, where A = {A \, . . . , 
Am) is the list of attributes. More precisely, if e evaluates to czi, . . . ,Op where 
Sj = {a], . . . , a™), then Aggr[z : P](e) replaces each a,- with (a), . . . , a™, /) where 
f = P{{\a\,...,a;\^). 

Finally, Group; [AS'.e](e') groups the tuples by the values of their first I attri- 
butes and applies e to the sets formed by this grouping. For example: 



Oi 


h 


Oi 


62 


02 


Cl 


02 


C2 



Oi 



02 



h. 

h 

C2 



\S.e 




Ol 


di 


Ol 


d2 


02 


91 



assuming that e returns {di,d2} when S = {61,62}) and e returns {gij for 

-S' = {C1,C2|. 

Formally, let e' evaluate to (oi, . . . , dp}. We split each tuple = (a], . . . , a™) 
into o' = (oj, . . . , of) that contains the first I attributes, and o" = (of^^, . . . , o™) 
that contains the remaining ones. This defines, for each dj, a set Sj = jo" | oj, = 
d'}. Let Tj = |6], . . . , 6^- '} be the result of applying e with S interpreted as Sj. 

Then Group; [AS'.e](e') returns the set of tuples of the form {d'j,b}), 1 < j < p, 
1 < z < rrij. 

Klug’s algebra. It combines grouping and aggregation in the same operation 
as follows: 
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Grouping & Aggregation Let t be of length m. Let I < i\ <...< ik with 
t.ij = n, and let be aggregates from 0. Then, for e an expression 

of type t, AggrjA \ . . . ,ik '■ is an expression of type t • n . . . n {t with 

k ns added at the end). 

The semantics is best explained by SQL: 

SELECT #1, . . . . . . ,Tk{#ik) 

FROM E 
GROUPBY#!,...,#? 

where E is the result of the expression e. (As presented in [20] , the algebra does 
not have arithmetic operations, and the aggregates are limited to the standard 
five.) 

Note that there are no higher-order operators in King’s algebra, and that it 
is expressible in our algebra with aggregates, as Aggrjti : Ti, . . . ,ik ■ Tk\{e') is 
equivalent to Group; [AS'.ej(e'), where e is 

Aggr[zfc - I : .7^fc](Aggr[ifc_i - I : Tk-i]{- ■ ■ (Aggr[ii - I : .?^i](S')) • • •)) 

Example. The query (*) from Section 2 is defined by the following expression 
(which uses the operator combining grouping with aggregation): 

7 Ii, 4 (ct[> 100000]5((Aggrj3 : A,3 : S]{TT2,3Ai^i=3iSi x S'2)))))) 

where A is the aggregate AVG, ^ is TOTAL, and > 100000 is a unary predicate 
on N which holds of numbers n > 100000. 

Example. The only aggregate that can be applied to non-numerical attributes in 
SQL is COUNT that returns the cardinality of a column. It can be easily expressed 
in AuGaggr as long as the summation aggregate ^ and constant 1 are present. 
We show how to define Countm(e): 

SELECT #1, . . . , #m - 1,C0 UNT(#to) 

FROM E 

GROUPBY #1, . . . ,#m 

First, we add a new column, whose elements are all Is: e\ = Apply[l](:(e). 
Then define an expression e' = Aggr[2 : A] (S'), and use it to produce 

62 = Group„_i[AS.e'](ei). 

This is almost the answer: there are extra 2 attributes, the mth attribute of e, 
and those extra Is. So finally we have 

Count„(e) = 7ri^,.._^_i_„+2(Group„_i[AS.Aggr[2 : A](S)](Apply[l]e(e))) 

Remark In previous papers on the expressive power of SQL [23,24,21,18], we 
used languages of a rather different flavor, based on structural recursion [4] and 
comprehensions [28]. One can show, however, that those language and AuGaggr 
have the same expressiveness, provided they are supplied with the same set of 
aggregates and arithmetic functions. The proof of this will be given in the full 



version. 
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4 Locality of SQL Queries 

What kind of general statement can one provide that would give us strong evi- 
dence that SQL cannot express recursive queries? For that purpose, we shall use 
the locality of queries. Locality was the basis of a number of tools for proving 
expressivity bounds of first-order logic [15,13,11], and it was recently studied on 
its own and applied to more expressive logics [17,22]. 







Fig. 1. A local formula cannot distinguish (a, 6) from (6, a). 



The general idea of this notion is that a query can only look at a small 
portion of its input. If the input is a graph, “small” means a neighborhood of a 
fixed radius. For example. Fig. 1 shows that reachability is not local: just take a 
graph like the one shown in the picture so that there would be two points whose 
distance from the endpoints and each other is more than 2r, where r is the fixed 
radius. Then locality of query says that (a, b) and (&, a) are indistinguishable, as 
the query can only look at the r-neighborhoods of a and b. Transitive closure, 
on the other hand, does distinguish between (a, b) and (b,a), since b is reachable 
from a but not vice versa. 

We now define locality formally. We say that a schema SC is purely relational 
if there are no occurrences of the numerical type n in it. Let us first restrict 
our attention to graph queries. Suppose we have a purely relational schema 
R : bb; that is, the relation R contains edges of a directed graph. Suppose e is an 
expression of the same type bb; that is, it returns a directed graph. Given a pair 
of nodes a, b in R, and a number r > 0, the r -neighborhood of a, b in R, N^{a, b), 
is the subgraph on the set of nodes in R whose distance from either a or 6 is at 
most r. The distance is measured in the undirected graph corresponding to R, 
that is, R U R~^ . 

We write (a, b) (c, d) when the two neighborhoods, Nf^{a, b) and Nff{c, d), 
are isomorphic; that is, when there exists a (graph) isomorphism h between them 
such that h{a) = c, h{b) = d. Finally, we say that e is local if there is a number 
r, depending on e only, such that 

(a, 6) (c, d) {ci,b) G e{R) iff (c, d) G e{R). 

We have seen that reachability is not local. Another example of a non-local 
query is a typical example of recursive query called same-generation: 

sg{x,x) :- 

sg{x, y) :- R{x', x),R{y', y),sg{x', y') 
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This query is not local either: consider, for example, a graph consisting of two 
chains: (a, &i), ( 6 i, 62 ), • ■ • , (^mi , &m) and (a, ci), (ci, C 2 ), . . . , (c^i , c^). Assume 
that same-generation is local, and r > 0 witnesses that. Take to > 2r + 3, and 
note that the r-neighborhoods of ( 6 r+i, Cr+i) and Cr+ 2 ) are isomorphic. By 
locality, this would imply that these pairs agree on the same-generation query, 
but in fact we have (br+i,Cr+i) G sg{R) and {br+i, Cr+ 2 ) ^ sg{R). 

We now state our main result on locality of queries, that applies to the 
language in which no limit is placed on the available arithmetic and aggregate 
functions - all are available. We denote this language by ALGaggr(All, All). 

Theorem 1 (Locality of SQL). Let e be a pure relational graph query in 
ALGaggr(AII, All), that is, an expression of type bb over the scheme of one symbol 
R : bb. Then e is local. q 

That is, neither reachability, nor same-generation, is expressible in SQL over 
the base type b, no matter what aggregate functions and arithmetic operations 
are available. Inexpressibility of many other queries can be derived from this, for 
example, tests for graph connectivity and acyclicity. 

Our next goal is to give an elementary, self-contained proof of this result. The 
restriction to graph queries used in the theorem is not necessary; the result can 
be stated in greater generality, but the restriction to graphs makes the definition 
of locality very easy to understand. The proof will consist of three steps: 

1. It is easier to prove expressivity bounds for a logic than for an algebra. We 
introduce an aggregate logic Taggr, as an extension of first-order logic, and 
show how ALGaggr qucries are translated into it. 

2. The logic Taggr is still a bit hard to deal with it, because of the aggregate 
terms. We show that we can replace aggregate terms by counting quantifiers, 
thereby translating Taggr into a simpler logic Cc- The price to pay is that 
Cc has infinitary connectives. 

3. We note that any use of an infinitary connective resulting from translation 
of Taggr into Cc applies to a rather uniform family of formulae, and use this 
fact to give a simple inductive proof of locality of Cc formulae. 

5 Aggregate Logic and Relational Algebra 

Our goal here is to introduce a logic Taggr into which we translate ALGaggr ex- 
pressions. The structures for this logic are precisely relational databases over two 
base types with domains Dorn and Num; that is, vocabularies are just schemas. 
This makes the logic two-sorted] we shall also refer to Dorn as first-sort and to 
Num as second-sort. 

We now define formulae and terms of £aggr(f^, 0); as before, O is a set of 
predicates and functions on Num, and 0 is a set of aggregates. The logic is just 
a slight extension of the two-sorted first-order logic. 
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A ^C-structure D is a tuple (A, , R^), where A is a finite subset of 

Dom, and is a finite subset of 

U 

IldomjP) 

i=i 

where domj(I?) = A for ti-j = b, and domj(_D) = Num for ti-j = n. 

— A variable of sort i is a term of sort t, z = 1, 2. 

— If i? : t is in SC, and u is a tuple of terms of type t, then R{u) is a formula. 

— Formulae are closed under the Boolean connectives V, A, -■ and quantification 
(respecting sorts) . If x is a first-sort variable, 3a; is interpreted as 3a; G A; if 
A; is a second-sort variable, then 3k is interpreted as 3fc G Num. 

— If P is an n-ary predicate in fi and ti , . . . , r„ are second-sort terms, then 
P{t\, . . . , Tn) is a formula. 

— If / is an n-ary function in 12 and ti , . . . , r„ are second-sort terms, then 
/(ti, . . . , Tn) is a second-sort term. 

— If P is an aggregate in O, (p{x,y) is a formula and r(x, y) a second-sort 
term, then r'(af) = Aggr^^y. (i^(a?, y), r(a;, y)) is a second-sort term with free 
variables x. 

The interpretation of all the constructs except the last one is completely 
standard. The interpretation of the aggregate term-former is as follows: fix an 
interpretation a for x, and let B = {b \ D \= ip(a, 6 )}. If B is infinite, then r'(a) 
is fu,. If P is finite, say {bi, . . . ,bi}, then r'(a) is the result of applying /; to the 
multiset whose elements are T{a,bi), i = 1, ... ,1. 

It is now possible to translate ALGaggr into Ca.ggr- 

Theorem 2. Let e : t be an expression o/ ALGaggr(I3, 0) . Then there is a for- 
mula ifie{x) o/£aggr(f^, 0); with X of type t, such that for any SC -database D, 

e{D) = {a \ D \= ipe{a)} 

Proof. For the usual relational algebra operators, this is the same as the standard 
textbook translation of algebra expressions into calculus expression. So we only 
show how to translate arithmetic operations, aggregation, and grouping. 

— Numerical selection: Let e' = cr[P]q,...yj, (e), where P is a Pary predicate in 
Q. Then ipe'{x) is defined as ipe{x) A P{xi^, . . . ,Xi^). 

— Function application: Let e' = Apply[/]jj (e), where / : Num* — > Num is 
in Q. Then ipe,'{x,q) = ipeix) A (y = /(xq, . ..,Xi^)). 

— Aggregation: Let e' = Aggr[z : .P](e). Then ipe'(x,q) = (pe{x) A (y = 
Aggr^ry. ((pe(j/),p))- 

— Grouping: Let e' = Group„[AS'.ei](e 2 ), where Ci : u is an expression over 
SC U {S'}, and 62 over SC is of type t ■ s. Let x,y,z be of types t,s,u, 
respectively. Then 

(fie'{x,^ = 3y ipe.,{x,y) A (fei{z)[(pe 2 {x,v)/S{v)] 
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where the second conjunct is ipe-^(z) in which every occurrence of S{v) is 

replaced by (pei{x,v). 

The converse does not hold: formulae of /laggr need not define safe queries, 
while all AuGaggr queries are safe. It is possible, however, to prove a partial 
converse result; see [18] for more details. 

6 SQL Is Local: The Proof 

We start by stating our main result in greater generality, without restriction to 
graph queries. 

Let SC be pure relational (no occurrences of type n), and D an instance of 
SC. The active domain of D, adom{D), is the set of all elements of Dorn that 
occur in relations of D. The Caifman graph of D is the undirected graph G{D) on 
adom{D) with (a, b) € G{D) iff a, b belong to the same tuple of some relation in 
D. The r-sphere of a G adom{D), S^{a), is the set of all b such that d(a, b) < r, 
where the distance d{-,-) is taken in G{D). The r-sphere of a = (ai,...,afc) 
is S^{d) = Ui<fe The r -neighborhood of a, is a new database, 

whose active domain is S',? (a), and whose SC-relations are simply restrictions of 
those relations in D. We write a b when there is an isomorphism of relational 
structures h : N^{a) — >■ N^{b) such that in addition h{a) = b. Finally, we say 
that a query e of type b . . . b is local if there exists a number r > 0 such that, 
for any database D, a b implies that a G e(D) iff & G e(Z?). The minimum 
such r is called the locality rank of e and denoted by lr(e). 

Theorem 3. Let e be a pure relational query in ALGaggr(All, All), that is, an 
expression of type b . . . b over a pure relational schema. Then e is local. q 

Since ALGaggr(AII, All) can be translated into £aggr(AII, All), we must prove 
that the latter is local. The proof of this is in two steps: we first introduce a 
simpler counting logic, Cc, and show how to translate Taggr into it. We then 
give a simple proof of locality of Cc . 

The logic Cc is simpler than Taggr in that it does not have aggregate terms. 
There is a price to pay for this - Cc has infinitary conjunctions and disjunctions. 
However, the translation ensures that for each infinite conjunction or disjunction, 
there is a uniform bound on the rank of formulae in it (to be defined a bit later) , 
and this property suffices to establish locality. 

Logic Cc- The structures for Cc are the same as the structures for Taggr- The 
only terms are variables (of either sort); in addition, every constant c G Num is 
a term of the second sort. 

Atomic formulae are R{x), where R G SC, and xis a tuple of terms (that is, 
variables and perhaps constants from Num) of the appropriate sort, and x = y, 
where x, y are terms of the same sort. 

Formulae are closed under the Boolean connectives, and infinitary connec- 
tives: if Lpi, i G /, is a collection of formulae, then Vie/ Ti ^tnd Aie/ Ti &re Cc 
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formulae. Furthermore, they are closed under both first and second-sort quanti- 
fication. 

Finally, for every z G N, there is a quantifier 3z that binds one first-sort 
variable: that is, if ip{x,y) is a formula, then 3ix Lp{x,y) is a formula whose 
free variables are y. The semantics is as follows: D [= 3ix(p{x,a) if there are z 
distinct elements bi,. . . ,bi G A such that D ^ ip{bj,a), I < j < i. That is, the 
existential quantifier is witnessed by at least i elements. Note that the first-sort 
quantification is superfluous as 3x(p is equivalent 31a; ip. 

We now introduce the notion of a rank of a formula, rk((/?), for both Cc and 
^aggr- For Cc, this is the quantifier rank, but the second-sort quantification does 
not count: 

— For each atomic p, rk((p) = 0. 

— For p = \/.p, rk{p) = supj rk{p), and likewise for /\. 

— rk(^v3) = rk(</5). 

— rk(3za; p) = rk{p) + 1 for x first-sort; rk{3kp) = rk{p) for k second-sort. 

For £aggr, the definition differs slightly. 

— For a variable or a constant term, the rank is 0. 

— The rank of an atomic formula is the maximum rank of a term in it. 

— rk{pi * P 2 ) = max(rk((/?i), rk{p 2 )), for * G {V, A}; rk(-K^) = rk{p). 

— rk(/(n, . . . ,r„)) = maxi<i<„ rk(ri). 

— rk(3x:^) = rk{p) -|- 1 if a; is first-sort; rk{3kp) = rk{p) if k is second-sort. 

— rk(Aggrj;ri7. {p, r)) = max(rk((p), rk(T)) -|- m, where m is the number of first- 
sort variables in y. 

Translating C^ggr into Cc- This is the longest step in the proof, but although 
it is somewhat tedious, conceptually it is quite straightforward. 

Proposition 1. For every formula p{x) 0 / £aggr(All, All), there exists an equi- 
valent formula p°{x) of Cc such that rk(i^°) < ’ck{p). 

Proof. We start by showing that one can define a formula 3ixp in Cc, whose 
meaning is that there exist at least z tuples x such that p holds. Moreover, its 
rank equals rk{p) plus the number of first-sort variables in x. The proof is by 
induction on the length of x. If al is a single first-sort variable, then the counting 
quantifier is already in Cc- If fc is a second-sort variable, then 3ikp{k,-) is 
equivalent to Vc AcgC where C ranges over z-element subsets of Num - 
this does not increase the rank. Suppose we can define it for x being of length 
n. We now show how to define 3i(y, x)p for y of the first sort, and 3z(fc, x)p for 
k of the second sort. 

1. Let ip{z) = 3i{y,x)p{y,x, z) It is the case that there are z tuples {bj,3j) 
satisfying p(y, x, •) iff one can find an Ftuple of pairs ((rzi, mi), . . . , (ni,mi)) 
with all mjS distinct, such that 

- there are at least Uj tuples a for which the number of elements b satisfying 
p{b, a, •) is precisely nij, and 
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- Ej=i rij • nij > i. 

Thus, 'tp{^ is equivalent to 



i 

VA 3ujX (B'.nijy 

i=i 

where the disjunction is taken over all the tuples satisfying rij,mj > 0, rrijS 
distinct, and i® easy to see that a finite disjunction 

would suffice), and 3lnu(p abbreviates 3nmp A -'3(n + l)u(p. 

The rank of this formula equals rk{3\mjy(p) = rk(i^) + 1, plus the number 
of first-sort variables in x (by the induction hypothesis) - that is, rk(i^) plus 
the number of first-sort variables in (y,x). 

2. Let '!/'(•?) = 3i{k, x)(p{k, x, z). The proof is identical to the proof above up to 
the point of writing down the quantifier 3\mjk(p{k, •) - it is replaced by the 
formula Vc(AceC ^ Ac^C "''^tiere C ranges over Wj-element 

subsets of Num. As the rank of this equals rk((p), we conclude that the rank 
of the formula equivalent to ^/’(z) equals rk((p) plus the number of first-sort 
variables in x. 

This concludes the proof that counting over tuples is definable in £c- With 
this, we prove the proposition by induction on the formulae and terms. We also 
produce, for each second-sort term t(x) of /laggr, a formula iprix, z) of Cc, with 
z of the second sort, such that D ^ tpria, q) iff the value of r(a) on D is q. 

We may assume, without loss of generality, that parameters of atomic /laggr 
formulae R{-) and P(-) are tuples of variables: indeed, if a second-sort term 
occurs in R(-Ti-), it can be replaced by 3k {k = Ti) A R(-k-) without increasing 
the rank. We now define the translation as follows: 

— For a second-sort term t which is a variable q, tpt{q,z) = (z = q). If t is a 
constant c, then iptiz) = (z = c). 

~ For an atomic ip of the form x = y, where x, y are first-sort, (p° = ip. 

— For an atomic ip of the form P(ri(5’), . . . , r„(al)), ip°{x) is V(ci c„)eP 

Ar=i Ci). Note that rk((^°) = max^ rk{ipTi) < max^ rk(rj) = rk{ip). 

— {<P 1 V ip 2 )° = <P 2 i ^ ^ 2 )° = <P°/\ vl, (-'<p)° = -"F°, {^Xip)° = 3xip° 
for x of either sort. Clearly, this does not increase the rank. 

— For a term t{x) = /(ti(F), . . . ,Tn(x)), we have 

n 

i’rix, Z) = \J {z = c)A f\ Iprj {x, Cj) 

(c,Ci,...,Cn)-.C=f(c) j = l 

Again it is easy to see that rk(i/'r) < rk(r). 

— For a term t'(x) = Aggry^y. {ip{x,y},T{x,y)), 'ipr'{x,z) is defined as 

\vlo{x) A (z = /oo)] V h¥’^(x) A 'ip'ix, z)] 

where ip°^{x) tests if the number of y satisfying ip{x,y) is infinite, and -ip' 
produces the value of the term in the case the number of such y is finite. 
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The formula can be defined as 



V 

i\yi of 2nd sort 



CCNum,|C1— oo cGC 



where = 3(yi, . . . , y,_i, y,+i, . . . , 2 /). 

The formula iIj'{x,z) is defined as the disjunction of -•3y(f°{x,y) A z = fo 
and 



V 



/ Z = C 

A 3!nii7 {(p°{x,y) A 'ijjr{x,y,ci)) 

A ... 

A 3\my {y}°{x,y) A tpr{x,y,ci)) 

\A VyAaGNum(</=°(^:iO A V'r(^,y,a) ^ Vi=l(a 






C.))/ 



where the disjunction is taken over all tuples (ci, ni), . . . , (ci,ni), I > 0,rii > 
0 and values c G Num such that 




ni times ni times 



Indeed, this formula asserts that either (p{x, •) does not hold and then z = fo, 
or that Cl, . . . , c; are exactly the values of the term t{x, y) when (p{x, y) holds, 
and that n^s are the multiplicities of the c^s. 

A straightforward analysis of the produced formulae shows that rk{'tpr') < 
max(rk((^°), rk(AT)) plus the number of first-sort variables in y; that is, 
rk(Ar') < rk(r'). This completes the proof of the proposition. 



Cc is local. Formulae of /iaggr have finite rank; hence they are translated into 
Cc formulae of finite rank. We now show by a simple induction argument that 
those formulae are local. More precisely, we show that for every finite-rank Cc 
formula (/?(£, z) (x of first-sort, i of second-sort) over pure relational SC , there 
exists a number r > 0 such that a b implies D |= y>{a, A) ‘p{b, A) for any 
lo- The smallest such r will be denoted by \r{ip). The proof is based on: 

Lemma 1 (Permutation Lemma). Let D he first-sort, with A = adom{D), 
and r > 0. If a b, then there exists a permutation p : A ^ A such that 

ac ^p(c) for every c € A. 

Proof. Fix an isomorphism h : -A Nor+i{b) with h{a) = b. For any c G 

5'^+i(a), h{c) G S 2 r+i{b) has the same isomorphism type of its r-neighborhood. 
Thus, for any isomorphism type T of an r-neighborhood of a single element, there 
are equally many elements in A — S^r+iio) and in A — S§,._^_i{b) that realize T. 
Thus, we have a bijection g : A — S'^+i(a) -A A — S^r+ifb) such that c g{c). 
Then p can be defined as h on S'^+i(a), and as g on A — □ 
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Based on the lemma, we show that every Cc formula ip of finite rank is 
local, with Ir(i^) < — l)/2. Note that for the sequence ro = 0, . . . ,rj+i = 

3ri + 1, . . ., we have = (3^ — l)/2; we show \r{<p) < rrk(<^)- 

The proof of this is by induction on the formulae, and it is absolutely straight- 
forward for all cases except counting quantifiers. For example, if if{x, i) = \/ j ipj 
and m = rk(i^), then by the hypothesis, \r((pj) < Xm, as So 

fix to) and let a b. Then D \= ipj{a, zq) ‘fjib, ?o) for all j by the induction 
hypothesis, and thus D ^ ip{a,io) O (^(&, 7o)- 

Now consider the case of the counting quantifier ip{x,i) = 3izip{x, z,i). Let 
rk{ip) = m, then rk(f/)) = m -I- 1 and r^+i = 3rm + 1- Fix iq, and let a b. 

By the Permutation Lemma, we get a permutation p : A ^ A such that ac 
bp{c). By the hypothesis, \r{<p) < Xm, and thus D |= (^(a, c, zq) O (^(6, p(c), zq). 
Hence, the number of elements of A satisfying ip{a, •, zq) is exactly the same as the 
number of elements satisfying ip(b,-,io), which implies D ^ 'ip{d,io) z/>(6, 7 q)- 
This concludes the proof of locality of Cc ■ 

Putting everything together, let e be a pure relational expression of AuGaggr 
(All, All). By Theorem 2, it is expressible in £aggr(All, All), and by Proposition 1, 
by a Cc formula of finite rank. Hence, it is local. 



7 SQL over Ordered Domains 

So far the only nonnumerical selection was of the form cTj^j, testing equality of 
two attributes. We now extend the language to ALGaggr by allowing selections 
of the form CTj<j(e), where both z and j are of the type b, and < is some fixed 
linear ordering on the domain Dorn. 

This small addition changes the situation dramatically, and furthermore in 
this case we can’t make blanket statements like “queries are local” - a lot will 
depend on the numerical domain Dorn and available arithmetic operations. 

7.1 Natural Numbers 

Let Num = N. We consider a version of ALGaggr that has the most usual set of 
arithmetic and aggregate operators: namely, -I-, •, < and constants for arithmetic, 
and the aggregate This suffices to express aggregates MIN, MAX, COUNT, TOTAL, 
but certainly not AVG, which produces rational numbers. 

We shall use the notations: 

- SQLn for ALGaggr({+, <, 0, 1}, {A}), and 

- SQL< for ALG<g,({-k,-,<,0,l},{A}). 

It is sufficient to have constants just for 0 and 1, as all other numbers are definable 
with -T. 

We show how a well-known counting logic FO(C) [3] can be embedded into 
SQL^. The importance of this lies in the fact that FO(C) over ordered structures 
captures a complexity class, called TC° [3,26], for which no nontrivial general 
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lower bounds are known. In fact, although TC° is contained in DLOGSPACE, 
the containment is not known to be proper, and to this day we don’t even know 
if TC° yf NP. Moreover, there are indications that proving such a separation 
result, at least by traditional methods, is either impossible, or would have some 
very unexpected cryptographic consequences [27]. 

Definition of FO(C). (see [3,10,19]) It is a two-sorted logic, with second sort 
being the sort of natural numbers. That is, a structure D is of the form 

({ai,...,a„},{l,...,n},<,-b,-,l, n, Ri, . . . , Ri), 

where the relations Ri are defined on the domain {ai,...,a„}, while on the 
numerical domain {1, . . . , n} one has 1, n, < and -I-, • interpreted as ternary pre- 
dicates (e.g., +{x,y,z) holds iS x + y = z). This logic extends first-order by 
counting quantifiers 3ix (f{x), meaning that at least i elements satisfy here 
i refers to the numerical domain {1, . . . , n} and x to the domain {oi, . . . , a„}. 
These quantifiers bind x but not i. 

Theorem 4. Over ordered structures, ¥0{C) C SQL^ . In particular, 

uniform TC° C SQL^. 

Proof sketch. With order and aggregate TOTAL, one can define the set {1, . . . , m} 
where m =j adom{D) \ (by counting the number of elements not greater than 
each element in the active domain). On this set, one defines -I-, •, <, and then uses 
the standard translation of calculus into algebra, except for using the aggregate 
^ to translate counting quantifiers. □ 



Corollary 1. Assume that reachability is not expressible in . Then uniform 
TC° is properly contained in NLOGSPACE. 

As separation of complexity classes is currently beyond reach, so is proving 
expressivity bounds for SQL^. 

One can also show a closely-related upper bound on the class of decision 
problems expressible in SQL^: 

Proposition 2. Every Boolean query in SQL^ is contained in P-uniform TC°. 

Notice that the reachability query, even over ordered domains of nodes, is 
order-independent, that is, the result does not depend on a particular ordering 
on the nodes, just on the graph structure. Could it be that order-independent 
queries in SQLpj and SQL^ are the same? Of course, such a result would imply 
that TC° is properly contained in DLOGSPACE, and several papers suggested 
this approach towards separating complexity classes. Unfortunately, it does not 
work, as shown in [17]: 
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Proposition 3. There exist order-independent non-local queries expressible in 
SQL^. Thus, there are order-independent SQL^ queries not expressible in SQLjs}. 

Proof sketch. On the graph of an n-element successor relation with an extra pre- 
dicate P interpreted as the first [log 2 nj elements, one can define the reachability 
query restricted to the elements of P. □ 

Counting abilities of SQLfj are essential for this result, as its analog for rela- 
tional calculus does not hold [9]. 



7.2 Rational Numbers 

The language SQL^ falls short of the class of queries real SQL can define, as it 
only uses natural numbers. To deal with rational arithmetic (and thus to permit 
aggregates such as AVG), we extend the numerical domain Num to that of rational 
numbers Q, and introduce the language 

SQL< as ALG<g,({-h,-,-,^,<,0,l},{r}). 

This is a stronger language than SQL^ (and thus than FO(C)) - to see this, 
note that it can define rational numbers, and if one represents those by pairs of 
natural numbers, in some queries these numbers may grow exponentially with 
the size of the database: something that cannot happen in the context of SQL^. 

The most interesting feature of SQLq is perhaps that it is capable of coding 
inputs with numbers: 

Theorem 5. Let SC be a pure relational schema. Then there is an SQLq ex- 
pression esc of type n such that for every SC -database D, esc{D) is a single 
rational number, and 



D\^ P>2 => esc{Di) yf esc{D2) 

Proof sketch. The proof is based on the following: if Pi and P 2 are two distinct 
nonempty sets of prime numbers, then ^ then code 

tuples with prime numbers (at most polynomial in the size of the input) and 
add up inverses of those codes. □ 

Thus, with the addition of some arithmetic operations, SQLq can express 
many queries; in particular, SQLq extended with all computable numerical fun- 
ctions expresses all computable queries over pure relational schemas! In fact, to 
express all computable Boolean queries over such schemas, it suffices to add all 
computable functions from Q to {0, 1}. In contrast, one can show that adding 
all computable functions from N to {0, 1} to SQL^ does not give us the same 
power, as the resulting queries can be coded by non-uniform TC° circuits. Still, 
the coding is just of theoretical interest; even for graphs with 20 nodes it can 
produces codes of the form ^ with p, q relatively prime, and q > for 

q > one needs only 60 nodes. 
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8 Conclusion 

Did SQL3 designers really have to introduce recursion, or is it expressible with 
what’s already there? Our results show that they clearly had a good reason for 
adding a new construct, because: 

1. Over unordered types, reachability queries cannot be expressed by the basic 
SQL SELECT-FRDM-WHERE-GROUPBY-HAVING statements; in fact, all queries 
expressible by such statements are local. 

2. Over ordered domains, with limited arithmetic, reachability queries are most 
likely inexpressible, but proving this is hard as separating some complexity 
classes (and perhaps as hard as refuting some cryptographic assumptions). 
Adding more arithmetic operations might help, but only at the expense of 
encodings which are several thousand digits long - so the new construct is 
clearly justified. 

Being a theoretician, I like to see proofs of theorems (even folk theorems!), 
hence writing all those papers [23,21,24,18] on the expressiveness of SQL. Having 
finished [18] just over a year ago, I felt that the whole story can be presented in 
a nice and clean fashion, without asking the reader to spend days studying the 
prerequisites. I’ve attempted to give such a presentation here. I hope I convinced 
you that next-generation database theory texts shouldn’t just state that certain 
queries are inexpressible in SQL, they should also include simple proofs of these 
results. 
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Abstract. A number of efficient methods for evaluating first-order and monadic- 
second order queries on finite relational structures are based on tree-decompo- 
sitions of structures or queries. We systematically study these methods. In the 
first-part of the paper we consider tree-like structures. We generalize a theorem of 
Courcelle [7] by showing that on such structures a monadic second-order formula 
(with free first-order and second-order variables) can be evaluated in time linear in 
the structure size plus the size of the output. In the second part we study tree-like 
formulas. We generalize the notions of acyclicity and bounded tree-width from 
conjunctive queries to arbitrary first-order formulas in a straightforward way and 
analyze the complexity of evaluating formulas of these fragments. Moreover, we 
show that the acyclic and bounded tree- width fragments have the same expressive 
power as the well-known guarded fragment and the finite-variable fragments of 
first-order logic, respectively. 



1 Introduction 

Evaluating first-order, or relational calculus, queries against a finite relational database 
is well-known to be PSPACE-complete [17]. The complexity we refer to here is called 
the combined complexity of the query language [22], i.e. the complexity of the evaluation 
problem measured both in terms of the length of the query and the size of the database. 
Many research efforts went into handling this high worst case complexity. In practice, 
various query optimization heuristics are used; they are based both on the structure of 
the queries and the (expected) structure of the databases. 

One of the important theoretical notions is that of acyclic conjunctive queries (cf [1]). 
Yannakakis [24] proved that acyclic conjunctive queries can be recognized and evaluated 
efficiently. In the last few years, there has been renewed interest in acyclic conjunctive 
queries and related notions based on the graph theoretic concept of tree-width [6,15,12, 
13]. Whereas these approaches concentrate on conjunctive queries, there have also been 
attempts to isolate larger fragments of first-order logic whose combined complexity is 
in PTIME. In particular, Vardi [23] observed that this is the case for the finite variable 
fragments of first-order logic. A fragment of first-order logic that has recently received 
a lot of attention is the guarded fragment [3]. Gottlob, Grade 1, and Veith [11] proved 
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that its combined complexity is both linear in the length of the formula and the size of 
the database. 

There is a different way of using tree-width to evaluate queries that originated in the 
area of graph algorithms. Here we do not restrict the class of queries, but the class of input 
structures, or databases. This approach does not only work for first-order logic, but even 
for the much stronger monadic second-order logic. Courcelle [7] proved that Boolean 
monadic second-order queries on structures of bounded tree-width can be evaluated in 
time linear in the size of the input structure, or more precisely in time f{l) ■ n, where I 
is the length of the query, n is the size of the structure, and / : N — N is some (fast- 
growing) function.* Amborg, Lagergren, and Seese [4] extended Courcelle’s result by 
showing that on structures of bounded tree-width the number of satisfying assignments 
of a monadic second-order formula with free variables can be computed in time linear 
in the size of the input structure. 

In the first part of this paper we shall further extend this approach by proving that 
on structures of bounded tree -width the set of all satisfying assignments of a monadic 
second-order formula (with free first and second-order variables) can be computed in time 
linear in the size of the input structure and the size of the output. For example, all cliques 
of a graph Q of bounded tree-width can be computed in time 0{\G\ clique in q I^D- 

Similarly, all Hamiltonian cycles of a graph of bounded tree-width can be computed in 
time linear in the size of the graph and the output. 

In the second part of the paper we study fragments of first-order logic whose for- 
mulas have a tree-like structure. We start by reviewing Yannakakis’s [24] algorithm for 
evaluating acyclic conjunctive queries. Practically the same algorithm can be used to 
evaluate conjunctive queries of bounded tree-width [6,15], bounded query-width [6], or 
bounded hypertree-width [13]. We extend the notions of acyclicity and tree-width from 
conjunctive queries to full first-order logic. Our approach is based on the well-known 
correspondence between first-order formulas and non-recursive stratified datalog pro- 
grams (cf [1]). We generalize acyclicity and tree-width in a straightforward way to 
such programs. Yannakakis’s algorithm immediately gives us algorithms for evaluating 
queries defined by acyclic or bounded-tree-width programs. 

Then we show that the tree-like fragments of first-order logic obtained by this ap- 
proach are closely related to the two fragments mentioned earlier: Acyclic programs 
correspond to guarded first-order formulas and bounded tree-width programs corre- 
spond to finite-variable first-order formulas. The latter extends an observation Kolaitis 
and Vardi [15] made on the level of conjunctive queries. In particular the result on the 
guarded fragment, though not difficult to prove, is quite remarkable, since the motivation 
for introducing the guarded fragment was completely different. Nevertheless, it turns out 
that this new fragment can be described in terms of the well-known concept of acyclicity. 
The second part of our paper thus shows that Yannakakis’s method of evaluating tree-like 
queries is very far reaching and comprises all the methods used to evaluate queries in 
the other fragments of first-order logic mentioned here. 



* Another way to phrase Courcelle’s result is to say that the data complexity [22] of monadic 
second-order logic on graphs of bounded tree-width is in linear time. However, in this paper 
we consequently maintain the view of combined complexity. 
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In the last section we discuss the connections between the two parts of the paper. 
The algorithms in both parts are very similar, and we give an explanation of why that is 
using tree-automata. 

It is one of our main objectives to analyze the running times of the algorithms as 
precisely as possible, instead of just saying that we have polynomial time algorithms. 
Occasionally, this requires considerable additional efforts. Furthermore, we are always 
interested in evaluating formulas with free-variables and not just sentences. Let us also 
emphasize that we are not fixing a vocabulary for our formulas and structures in advance, 
but let the vocabulary vary with the inputs. 

Due to space limitations, we can at most sketch the proofs of our results in this 
extended abstract. For details and additional background information, we refer the reader 
to the full version of this paper [10]. 

2 Preliminaries 

Structures aud Queries. A vocabulary r is a finite set of relation symbols. The arity 
of a vocabulary is the maximum of the arities of the relation symbols it contains. A 
T-structure A consists of a non-empty set A, called the universe of A, and a relation 
C A'’ for each r-ary relation symbol i? G t. If A is a structure and B C A non- 
empty, then {B)-^ denotes the substructure induced by A on B. We only consider finite 
structures. It is convenient to assume that elements of a structure are natural numbers. 
In other words, the universe of every structure considered here is a finite subset of N. 

STR denotes the class of all structures. If C is a class of structures, C[t] denotes the 
subclass of all r-structures in C. We consider graphs as {i?}-structures Q = (G, E^), 
where E'^ is an anti-reflexive and symmetric binary relation. A colored graph is a 
structure B — {B, if®, P®, . . . , P®), where {B, P®) is a graph and the unary relations 
P®, . . . , P® form a partition of the universe B. 

A k-ary query of vocabulary t is a mapping x that associates with each structure 
A G STR[r] a fc-ary relation x(-4) C such that for every isomorphism f : A B 
between structures A,B G STR[t] we have x(‘®) = /(x(-4)). We admit k = 0 and let 
A° consist of one element (the empty tuple) for every A. We identify with True and 
0 with False. 0-ary queries are usually called Boolean queries. 



Logics. FO and MSO denote the classes of formulas of first-order logic and of mona- 
dic second-order logic, respectively. If L is a class of formulas, then L[r] denotes the 
class of all formulas of vocabulary t in L. For a structure A G STR and a formula 
, Xi, xi, . . . , Xm) with free monadic second-order variables Xi,. . . ,Xi and 
free first-order variables xi,. . . ,Xm'we let 

if{A) := {(Ai,... ,A;,ai,... , a„) | A h ,Ai,ai , . . . , Ctyn) } • 

For sentences we have p{A) = True if, and only if, A satisfies p. If p has no free 
second-order variables, then we call the mapping A p{A) the query defined by p. 

In this paper, for various logics L and classes C of structures we will study the 
following evaluation problem for h on C: 
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Input: Structure A& C, formula (p G L. 
Problem: Compute 



For the class C of all structures we call this problem the evaluation problem for L. 

We often denote tuples (ui , . . . , ) of elements of a set A by a, and we write a € A 

instead of d G A^. Similar notations are used for tuples of subsets, tuples of variables, 
etc. 



Coding Issues. Our underlying model of computation is the standard RAM-model with 
addition and subtraction as arithmetic operations (cf [2,21]). We use the uniform cost 
measure. We will carefully distinguish between the size \ |o| | of an object o, which is the 
length of a natural encoding of o, and, if o is a set, its cardinality, denoted by |o|. For 
example, if R is an r-ary relation on a set A, then for a reasonable encoding of R we 
have ||i?|| = 0(r • |i?| -F 1). The size ||A|| of a r-structure A is 0{\A\ + 



Tree-Decompositions. A tree is a connected acyclic graph T. We always fix an (ar- 
bitrary) root G T in a tree T. Then we have a natural partial order on T, which 
is defined by: {t <T u t appears on the path from to u). We say that u is a 

child of t or t is the parent of u if {t, u) G E'^ and t <A u. For every t G T we let 
Tt ■= ({u I t <A u})'^ be the subtree rooted at t. A tree T is binary if every node has 
either 0 or 2 children. 

It will be convenient to work with hypergraphs (as an abstraction of relational struc- 
tures). A hypergraph "H is a pair {H, E^) consisting of a non-empty set H of vertices 
and a set E^ of non-empty subsets of H called hyperedges. 

A tree-decomposition of a hypergraph "H is a pair (T, {Ht)t£T), where T is a tree 
and {Ht)t^T a family of subsets of H (called the blocks of the decomposition) such that 

(1) For every v £ H, the set {t G T | u G Ht} is non-empty and connected (i.e. a 
subtree). 

(2) For every e G E^ there isat £ T such that e C Ht. 

The width of a tree-decomposition (T, {Hfjt^T) is max{|iTi| | t G T} — 1. The tree- 
width tw("H) of T~L is the minimum width over all possible tree-decompositions of T~L. 

A tree-decomposition of a r-structure A is a tree-decomposition of the hypergraph 

(^A, {{oi, . . . , Or} \ 3R £ T, R r-ary, (m, . . . , Or) G ; 

the tree-width of A is defined accordingly. 

A tree -decomposition (T, {Ht)teT) of a hypergraph R is reduced if for all t,u £ 
T,t f u, we have Ht % iT„. For a reduced tree-decomposition (T, {Hft^r) of "H we 
have |T| < \H\. 

Theorem 2.1 (Bodlaender [5]). There are a polynomial p{X) and an algorithm that, 
given a hypergraph %, computes a reduced tree-decomposition of % of width w := 
tw{n) in time • \H\. 
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3 Tree-Like Structures 

In this section we present algorithms for evaluating monadic-second order formulas. 
We first deal with trees and then show how to extend the results to arbitrary structures 
parameterized by tree -width. Our automata theoretic approach is based on ideas of 
Amborg, Lagergren, and Seese [4]. 

For a finite alphabet F we let Tp be the vocabulary consisting of a binary relation 
symbol E and a unary relation symbol Pj for alljGF.AP -tree is a colored graph of 
vocabulary Tp whose underlying graph is a binary tree, and a colored tree is a -tree 
for some F. 

We say that a vertex f of a colored tree T has color 7 (and write 7(f) := 7), if 
t G P^. 

A (bottom-up) F-tree automaton is a tuple 21 = {Q, S, A, F), where Q is a finite set, 
the setof^tote^, Z\ : F Q h the starting function , F C Q is the set of accepting states 
and b : [Q]-^ X Q is the transition function (hence we only consider deterministic 
automata). Here, [Q]-"^ := {{q,q'} | g, g' G Q} is the set of singletons and pairs of 
elements of Q. The run p : T — ^ Q of 2t on -tree T is defined in a bottom-up manner 
(i.e., from leaves to the root): Iff is a leaf, then p{t) := A{j{t)); iff has children si, S2, 
thenp(f) := b({p(si),p(s2)}, 7(f))-The automaton 2tacce/7teTifp(r^) G F. Aclass 
of colored trees is recognizable, if it is the class of colored trees accepted by some tree 
automaton. 

Theorem 3.1 (Thatcher and Wright [19]). Let F be a finite alphabet. A class of F- 
trees is recognizable if and only if it is definable by an MSO [Tp]-sentence. Furthermore, 
there is an algorithm that computes the tree automaton corresponding to a given MSO- 
sentence. 

Theorem 3.2. There exist a function / : N — >■ N and an algorithm that solves the 
evaluation problem for ViSO-formulas on colored trees in time /(||<p||)’(|r|-F 1 1 v?(7^) 1 1) • 

Proof: We can restrict our attention to MSO-formulas without free first-order variables. 

Let r be an alphabet and p{Xi , . . . , Xk) G MSO[t/-]. Theorem 3.1 only applies to 
sentences. Therefore, we replace the variables Xi, . . . , Xk by new unary relation sym- 
bols appropriately. We set F' := Fx {0, Ij'^.AF-treeT together with Fi , ... ,BkQT 
leads to a F'-tree T' := (T; Fi , . . . , F^.) in a natural way: Denoting the color oft G T 
in T' by f{t), we let 

7'(f) = (7) e) 7(f) = 7 and (f G Fj Ci = 1) for f = 1, . . . , k. 

We call e the additional color of f. 

The class of F'-trees {(T; F) | T colored r-tree, B FT, B G piff)} definable 
in MSO; hence, there is a F'-automaton 21 = {Q, A, S, F) recognizing it. Let T be a 
F-tree. We describe how to compute the set p{T) = {F C T | 2t accepts (T; F)}. To 
do so, we pass through the tree T three times: 

(1) Bottom-up. By induction from the leaves to the root, we first compute, for every 
f G F, a set Ft of “potential states” at f: If f is a leaf, then Ft := {Z\((7(f), e)) | e G 
{0,l}^}.Foran inner vertex f with children si and S2, we set 

Pt ■= I qi G Ps^, 92 G Ps^, e G {0, 1}'"}. 
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Then for all f G T and q € Qwe have: q € Pt if, and only if, there are sets i?i , . . . , Bk C 
T such that for the run p of 21 on (T;B) we have p{t) = q. If Pr C\ F = 0 we have 
p>{T) = 0, and no further action is required. 

(2) Top-down. Starting at the root r := r'^ we compute, for every t G T, the subset 
St of Pt of “success states” at t: We let Sr ■= F (1 Pr. If t has parent s and sibling t', 
then 



St ■■= {qG Pt\ there are q' G Pf, e G {0, 1}'^ such that S{{q, q'}, ( 7 ( 5 ), e)) G S'^}. 



Then for all f G T and q G Q'we have: q G St if, and only if, there are sets i?i , . . . , Bk C 
T such that 21 accepts (T ; B) and p{t) = q for the run p of 21 on (T; B). 

(3) Bottom-up again. Recall that for t G T, by we denote the subtree of T rooted 
in t. For t GT and q G Stwe let 



Satt^q := {B C Tt \ for 1 < i < k there k B[QT such that B[CiTt = Bi, 

21 accepts (T; B'), and for the run p of 21 on (T; B'): p{t) = q}. 

We compute the sets Satt ^ inductively from the leaves to the root. Let t gT and q G St. 
SetB{ := {f} andSo = 0- Iff isaleaf, thenSatj^q = {(5*^, . . . , Bl^) \ = 

q}. Iff has children s and s', then 



Satt^q = 



|(i?i U U BI^, ... ,BkU B'f.U B[^) € G {0, 1}*, there exist 



q' G Ss,q" G S's'SUch that B G Sats,g>,B' G Satspq", S({q', q"}, (j(t), e)) = g|. 



Note that p(F) = U^eSr Now we describe an algorithm evaluating a formula on 

a tree: 



/nput: Colored tree T, MSO-formula p(2Ci, . . . , Xk). 

1. Check if there is an alphabet P such that T is a colored F -tree and p is an 
MSO[t/-] - formula; if this is not the case then return 0. 

2. Compute the C'-tree automaton 21 corresponding to p. 

3. For all f G T, compute Pt. 

4. For all t GT, compute St. 

5. For all f G T and q G St, compute Satj g. 

6. Return UqeSr 

This algorithm can be implemented to work within the desired time bounds. The most 
difficult step is 5; it is also the only step where we need the automaton to be deterministic. 

□ 



The following corollary easily follows from the proof of Theorem 3.2: 

Corollary 3.3. There exist a function / : N — >■ N and an algorithm that, given a colored 
tree T and an MSO-formula p{X\, . . . . . . ,Xm), decides in time /(||<p||) • 

|T| if there are sets B\, . . . , Bi C T and elements a\, . . . , Om G T such that F ^ 
p{B \, ... , Bi,ai, . . . , am), and, if this is the case, computes such sets and elements. 
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For a given structure A, by using a tree-decomposition (T, {Ht)t^T) of A with un- 
derlying binary tree T and by encoding the isomorphism type of ^ in a coloring of T 
appropriately, we obtain from Theorem 3.2: 

Theorem 3.4. There exist a function / : N x N — >■ N and an algorithm that solves the 
evaluation problem for yiSiO-formulas in time f{f\ip\\,bN{A)) ■ (|xl| + ||i^(xf)||). 

Remark 3.5. Courcelle and Mosbah [8] get an algorithm for evaluating MSO-formulas 
on graphs of bounded tree-width out of a more algebraic framework. However, they did 
not analyze the complexity of their algorithm, and it does not seem to be linear in the 
size of the output (as ours is). 

Corollary 3.6. There exist a function / : N x N — >■ N and an algorithm that, gi- 
ven a structure A and an MSO-formula . . . , Xi, x\, . . . , Xm), decides in time 

/(||i^| I, tw(xf)) • |A|, if there are sets Bi, . . . ,Bi C T and elements Oi, . . . , G T 
such that B ^ p{Bi, ... , Bi, ai, . . . , Om), and, if this is the case, computes such sets 
and elements. 

Remark 3.7. There is a trick that can sometimes improve the running time of our algo- 
rithms considerably, in particular if the arity of the vocabulary, say r, is high: 

We let Tf, be the vocabulary that contains a unary relation symbol Pr for i? G t and 
binary relation symbols Ei, . . . ,Eg, where s is the arity of r. Then with every r-stmcture 
A we associate a Tb-stmcture Ab, the bipartite structure associated with A. The universe 
consists of A together with a new vertex bna for all i? G r and d G The relation Ei 
holds for all pairs (ai,bnai...ar), P^'’ := {bug, | d G R-^}. Ab can be computed 
from xf in linear time, and we have |H;,| = 0(||xf||). Moreover, tw(xfh) < tw(xf) -F 1. 
If the arity of t is at most tw(xl), this can be improved to tw(xlh) < tw(xl). The tree- 
width of Ab can be considerably smaller than that of A. For example, if i? is a 1000-ary 
relation, then the tree width ofA= ({1, . . . , 1000}, {(1, . . . , 1000)}) is 999, whereas 
the tree-width of xlh is 1. For graphs G we have tw((/) = tw((/{,). 

It is easy to see that there is a linear-time algorithm that associates with every MSO- 
formula ip an MSO-formula pb such that for all structures A we have p{A) = pb{Ab). 
Thus to evaluate MSO-formulas we can also proceed as follows: Given a formula p and a 
structure A, we first compute pb and Ab- Then we compute pb{Ab) using our algorithms. 
By Theorem 3.4, this requires time 0[f [\\pb\\,t'w{Ab)) ■ (||xl|| -F ||(/?(xl)||)), which 
can be much better than /(||i^||,tw(xl)) • (|H| -F ||(^(.4)||). 

There is another advantage in working with the structure Ab instead of A: MSO 
becomes more expressive. Intuitively, the reason is that in Ab we can talk about sets of 
“edges”. For example, it is easy to see that there is an MSO-formula p{X) such that 
for every graph Q, p{Gb) consists of all sets {bsab \ ab G Et}, where El ranges over 
the edge sets of all Hamiltonian cycles of G- Thus in a sense p defines the set of all 
Hamiltonian cycles of a graph. This is not possible by an MSO-formula in the original 
G- As a matter of fact, there is not even an MSO-sentence that holds in a graph G if, and 
only if, G is Hamiltonian [16]. 
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4 Tree-Like Formulas 

In this section, we restrict our attention to first-order formulas. Recall that atomic for- 
mulas, or atoms, are formulas of the form x = y or Rx\ . . .Xr for an r-ary relation 
symbol R. The set of all atoms occurring in a formula if is denoted by at((p). Literals 
are atomic or negated atomic formulas. The set of all variables occurring in a formula Lp 
is denoted by \ar{ip), the set of free variables of if by free((^). With every formula tp we 
associate a hypergraph := (var(i^) , {var(a) | a £ at(p)}) . A tree-decomposition 
of a formula pis a tree -decomposition of R^p. A tree-decomposition (T, (Xt)t<^r) of a 
formula p is strict if there exists at £ T such that free((^) C Xt. Tree-decompositions 
turn out to be quite useful when it comes to evaluating formulas of a very simple form, 
which are known as conjunctive queries. 



Acyclic Conjunctive Queries. A conjunctive query is a first-order formula of the form 
3yi . . . A”=i ai with atomic formulas «i, ... , a„. In this subsubsection we explain 
an algorithm due to Yannakakis for evaluating acyclic conjunctive queries. 

We have to recall some basic notions of relational database theory. An A-relation 
TZ, for a finite set X, is a finite set of mappings with domain X. We let range(T^) 
U7e7^ think of an A-relation as an |A|-ary relation onrange(T^) in which we 

have associated a name (an element of A) with every place of the relation. Usually, A is a 
set of variables and range (7^) is contained in the universe of some structure. For Y C A, 
the Y-projectionofan A-relation7?.isthe set7Ty(7^) := {y\y \ 7 G 7^}. For sets A, Y 
of variables, the join of an A-relation TZ and a Y-relation S is the A U Y-relation 
TZ K S := {y : X yj Y -£ A \ y\x £ 7^, 7|v G 5}. For every formula p(xi, ... ,xi) 
and structure A the set p(A) is an {x\, . . . , a;i}‘relation over A. On the logical level, 
projections correspond to existential quantifications and joins correspond to conjunc- 
tions. In particular, for a conjunctive query p(xi , ... ,xi) = 3yi . . . AAi 
have p(A) = Tr{xi,... ,xi}{oii{A) txi • • • txi a„(A)) . The following two lemmas describe 
Yannakakis’s basic algorithm. The idea behind these lemmas is the following: We want 
to evaluate a conjunctive query 1^ in a structure A. Suppose we can efficiently compute 
a tree-decomposition (T, {Xt)t^T) of Y and, for every t £T, the Aj-relation 

:= IXl a{A). 

var{o;)CXt 



Then, noting that p{A) := Trfree(v) we can use Lemma 4.2 to compute p{A). 

Moreover, if the tree-decomposition is strict, i.e. if free((/?) C A^ for some t £ T,we 
can even do better using Lemma 4.1. 

Lemma 4.1. There is an algorithm solving the following problem in time 
0{\T\ ■ maxtgT||T*t||) : 

Input: Tree-decomposition (T, {Xt)t<zT), an A(-relation Vt for 
every t £T. 

Problem: Compute TZt := ttx* for all t £T. 
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Proof: The algorithm passes the tree twice. 

(1) Bottom-up. For every t G T v/e let Qt := TTXt ( ^ 'Pu]- Then if t is a leaf, we 
have Qt = Pt, and if t is an inner node with children ti, . . . , tm, we have 

Qt = TTXt (Pi IX IXl Qt J ) = T*t X TTXt ( txi Qt)=VtiXi tXI TTXtiQtj)- 

(2) Top-down. Note that for the root r := we have TZr = Qr- For a node 

t GT \ {r} with parent s we have TZt = Trxj iP.s) x Qt. 

Thus to compute Qt for all t G T amounts to computing at most one projection and 
one join of relations of size at most max^gT I l^t 1 1 for every tree -node. The same applies 
to 7^^. □ 



Lemma 4.2. There is an algorithm solving the following problem in time 

0(|r|-max,gT||Pt||-||5||) : 

Input: Tree-decomposition (T, (Xft^r), a subset X C {j^^rpXt, an 
Xj-relation Vt for every t G T. 

Problem: Compute S := 7Tx(I>^Pt)- 



Proof: We first compute the family {IZft^T as in Lemma 4.1. For every t G T, let 
Yt := C Xu). If t has a parent s, then let Zt := Xt (T Xg, and let Zr := 0 . 

Let St := TTYtUZt ( tXl Pu) ■ Then S — Sr. 

We can compute the relations St inductively in a bottom-up manner, noting that for 
a leaf t we have St = TryjUZt(’^t) and for an iimer node t with children ti, . . . , fm we 
have 

= TTyjUZt X X • • • X □ 

A hypergraph H is acyclic if there is a tree -decomposition (T, {Ht)t^T) of dL such 
that for every t G T there exists a hyperedge e G such that e = Xt. We call such a 
tree-decomposition a chordal decomposition of "H. A conjunctive query p is acyclic if 
its hypergraph is. p is strictly acyclic if has a chordal decomposition that is a 
strict tree -decomposition of p. 

Theorem 4.3 (Tarjan, Yannakakis [18]). Given a hypergraph H, it can be decided in 
linear time ifH is acyclic. If this is the case, a reduced and chordal decomposition of 
TL can be computed in linear time. 

Theorem 4.4 (Yannakakis [24]). There is an algorithm that solves the evaluation pro- 
blem for acyclic conjunctive queries in time O {f\p\\ ■ ||.4|| • ||(^(.4)||). 

For strictly acyclic conjunctive queries, this can be improved to 0(||(/?|| • ||-4||). 

Proof: Given A and p, we first compute a reduced chordal decomposition (T, {Xftsr) 
ofv3in time 0(||(/?||). 

For every atomic formula a G at(p), we let node(a) be the smallest t G T (with 
respect to <^) such that var(a) C Xt. It can be shown that node(o;) can be computed 
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in time 0(||^||). Furthermore, a{A) can be computed in time 0(||^||). For t £ T we 
let 

Vt := CXI a{A). 

aSat((^) 

node(a)=t 

Then Lp{A) = 7Tg.ee(<,3) (CXIP^). For t £ T we let at G at(i^) such that var(at) = Xt. 

ThenVt C thus ||Pi|| < ||at(^)|| < 1 1^| |. We can compute the family (Pi)tgT’ 

in time 0(| |</3| | • 1 1-4| |), because we only have to compute at most one join per atom of 
(f, and if these joins are computed in the right order, the intermediate relations obtained 
while computing Vt are all contained in at{A). We now apply Lemma 4.2. The stronger 
statement for strictly acyclic (/? follows from Lemma 4.1. □ 

The tree-width Xw{tp) of a formula tp is the tree-width of T~L^. The strict tree-width 
stw(v3) of p is the minimum width over all possible strict tree-decomposition of p. By 
a similar proof we get: 

Theorem 4.5 (Chekuri and Rajaraman [6]). The evaluation problem for conjunctive 
queries can be solved in time 0(2^*^“') ’ I I'/’l I + I I ’ (1^1“'"''^ + ||-4||) • 1 1(/?(.4)||), where 
w := tw{p), and in time 0(2^^^) • ||(/?|| -F ||(p|| • -F ||-4||)), where s := stw{p). 

Remark 4.6. Recall Remark 3.7, where we evaluated a query by first translating the 
input structure .4 to a bipartite structure At and the input formula p io a formula pt, 
with p{A) = pb{Ab)- A similar approach yields a variant of Theorem 4.5 which is 
sometimes better. Chekuri and Rajaraman [6] took this approach. 

Remark 4.7. Acyclicity and bounded tree-width of a conjunctive query are incompa- 
rable — all queries Ex\X 2 A Ex 2 Xz A ... A Exn-iXn A ExnXi, for n > 3, are of 
tree-width 2, but cyclic. On the other hand, the queries RnXi . . . x„, forn > 1 and n- 
ary R^, are acyclic, but their tree-width grows with n. The connection between acyclicity 
and bounded tree-width has also been discussed in the full version of [15]. 

Chekuri and Rajaraman [6] defined a common generalization of both acyclicity and 
tree-width they called query-width. Acyclic queries are precisely those of query-width 
1, and for all queries p we have query-width((/?) < Xw{Bf). Chekuri and Rajaraman 
[6] showed that conjunctive queries of bounded query-width can be evaluated in po- 
lynomial time (in the size of the input and the output). However, query-width has one 
big disadvantage: Queries of bounded query- width cannot be recognized in polyno- 
mial time. More precisely, for every q > 4 it is NP-complete to decide whether a 
given query has query-width at most q [13]. Gottlob, Leone, and Scarcello [13] there- 
fore introduced yet another width, the hypertree-width of a conjunctive query. Acyclic 
queries are precisely the queries of hypertree -width 1, and for every query p we have 
hypertree -width ((^) < query-width(i^). Moreover, for every > 1 there is a polynomial 
time algorithm that recognizes the queries of hypertree-width at most h. 

Although we do not want to give the details here, the basic idea of query-width 
and hypertree-width is easy to explain: Suppose that we are given a tree-decomposition 
(T, {Xt)teT) of a query p, and a mapping A that associates with every t £ T a set 
\{t) of atoms of p such that Xt C var(a). Suppose, moreover, that \\{t) \ < h 

for all t G T (very roughly, the hypertree-width of p is the smallest h such that a tree- 
decomposition and a mapping A with these properties exist). Then to evaluate in a 
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structure A, we can proceed as follows: For t G Twe let 

Vt := IXl 'KXti^iA)) 1X1 IXl a{A). 

aGX{t) o;Gat((/?) 

node{o;)=t 

Then(^(^) = Trfree(c^) ( IXl T^t)- Thus we can use Lemmas 4.1 and 4.2 to compute (^(^). 

Tree-decompositions and related notions have also extensively been studied in AI, 
[14] is a survey. 

Let us now try to extend the results on conjunctive queries to larger classes of formulas. A 
conjunctive query with negation is a formula of the form 3y Aj, where Ai , . . . , A„ 
are literals, i.e. atomic or negated atomic formulas. We let at+((/?) (at“((^)) denote the 
set of all atoms occurring positively (negatively, resp.) in Lp. 

The first observation we make is a disappointment: Evaluating conjunctive queries 
with negation whose hypergraph is acyclic is just as hard as evaluating arbitrary conjun- 
ctive queries with negation. To see this, let ip = Ai be an arbitrary conjunctive 

query with negation of vocabulary r. Let i? be a new |var((ij)|-ary relation symbol, x 
a tuple that contains all variables of p, and p* := 3yip^Rx A Ar=i ^i)- Then 
is acyclic, because the one node tree yields a chordal tree -decomposition of . For 
every r-structure A let A* be the t U {i?} -expansion of A with := 0. Then 
||Al*|| = 0{\\A\\) 2 Lwdp{A) = v?* (-4* ). Thus evaluating is not harder than evaluating 

A*- 

However, there is a refined notion of acyclicity for conjunctive queries with nega- 
tion: A conjunctive query with negation p is acyclic if it has a chordal decomposition 
(T, (Xt)t£T) such that for every t GT there is an atom a G at+((^) with Xt = var(a). 
If p has such a chordal decomposition that, in addition, is a strict tree-decomposition, 
then p is strictly acyclic. 

It is now easy to extend the previous results: 

Corollary 4.8. The evaluation problem for acyclic conjunctive queries with negation 
can be solved in • ||aI|| • \ \p{A)\\). For strictly acyclic queries, this can be 

improved to 0{\\p\ \ ■ ||Al||). 

Corollary 4.9. The evaluation problem for conjunctive queries with negation can be 
solved in time 0(2^1™) • llv^ll + llv^ll ’ (1^1“''’^ + ||-4||) • ||(^(Al)||), where w := 
and in time ’ ll‘/^ll + ll‘/^ll ’ + ll-4||)), where s := stw((/?). 



Non-recursive Stratified Datalog. A datalog rule (with negation) p is an expression 
of the form y G- Ai A . . . A A„, where 7 is an atom of the form Qx\ . . .xi with pairwise 
distinct variables x\, . . . ,xi G var(Ai A . . . A A„) and Ai , . . . , A„ are literals. 7 is called 
the head and Ai A . . . A A„ the body of p. 

To define the semantics, suppose that 7 = Qxi . . .xi and let .4 be a structure whose 
vocabulary contains all relation symbols occurring in the body of p. Then we let 

p{A) := • 
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IfyisatuplethatconsistsofallvariablesinlJ^<-<„ var(Ai)\{ati, . . . , a;;} and (/?(x) := 
3y Ai<i<n then p{A) = <p{A). Thus datalog rules are just another way of writing 
conjunctive queries with negation. 

A datalog program is a finite set U of datalog rules. The intensional vocabulary 
int(TT) of n is the set of all relation symbols that occur in the head of some rule of 77. 
The extensional vocabulary ext(77) is the set of all relation symbols that occur in the 
body of some rule of 77 and are not contained in int(77). Program 77 is non-recursive, 
if no relation symbol that occurs in the head of a rule also occurs in the body of a rule. 
In the following we restrict our attention to non-recursive datalog programs. 

To define the semantics, let Q G int(77) and A an ext(77)-stmcture. Then we let 

Uq{A) := [J {p{A) \ p & n, Q occurs in the head of p}. 

The query A ^-7> 77q (^) can be defined by an existential first-order formula; conversely 
every query definable by an existential first-order formula can also be defined by a 
datalog program. 

A non-recursive stratified datalog (NRSD) program is a sequence 77 := (77^, . . . , 77”) 
of non-recursive datalog programs with the property that no Q G int(77^) occurs in 77^ 
for 1 < 7 < j < n. We set int(77) := lJ)Ai int(77i), and ext(77) := lJ)Ai (ext(77*) \ 
Upi int(77-’)) . The programs 77i, . . . , 77„ are called the strata of 77. 

To define the semantics, let A be an ext(77)-stmcture. We let := -4. Inductively 
over 7, 1 < 7 < n, we define Ai and Uq{A) for all Q G int(77®) as follows: Suppose 
that 1 < 7 < n and that the ext(77) U Uj=i int (777) -structure Ai-i is already defined. 
For Q G int(77*) we let IIq{A) := 77g(^i_i). Let Ai be the ext(77) U Uj=i int(777)- 
expansion of^i_i with Q-^' := Uq{A) for Q G int(77i). 

In the following, we assume that every NRSD-program has a distinguished inten- 
sional relational symbol, the goal predicate, which we always denote by Q. Then we 
write n{A) instead of IIq{A)', A Uq{A) is the query defined by 77. An NRSD- 
program 77 is Boolean, if it defines a Boolean query, i.e. if its goal predicate is 0-ary. An 
NRSD-program 77 is equivalent to a formula ip or to another program 77' if they define 
the same query. It is easy to prove the following (well-known) fact: 

Fact 4.10. A query is first-order definable if and only if it is NRSD-definable. 

The evaluation problem for a class P of datalog programs is the following problem: 



Input: Structure A, program 77 G P. 
Problem: Compute II (A). 



Tree-decompositions, tree-width, acyclicity, etc. of a datalog rule p are defined with 
respect to the corresponding conjunctive query with negation. For example, a tree- 
decomposition of a datalog rule p := y ^ Ki=i is a tree-decomposition of the 
hypergraph (var(p), {var(Ai) | 1 < 7 < n}), and a tree-decomposition (T, of 

p is strict if there is at G T such that var(7) C Xf. 

Definition 4.11. Let 77 = (77i, . . . , 77„) be an NRSD-program. 

(1) 77 is strictly acyclic if every rule of 77 is strictly acyclic. 
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(2) n is acyclic if (i7i, . . . , Un-i) is strictly acyclic and every rule p of iT„ is acyclic. 

(3) The strict tree-width of II is the number stw(i7) := max{stw(p) | p £ UiLi -^0- 

(4) The tree-width of II is the number tw(77) := max{stw((7Ti, . . . ,f7„_i))} U 
{tw(p) I p G i7„}. 

The following example shows why it is necessary that in an acyclic program we 
require all strata except for the last one to be strictly acyclic: 

Example 4.12. Let II = (iTi, . . . , 77„) be an NRSD-program. We construct an equi- 
valent program II' = (TTq, 77^, . . . , 77^) such that all rules of 77' are acyclic. Let m be 
the maximal number of variables occurring in a rule of 77 and X ^ int(77) U ext(77) 
a new m-ary relation symbol. Let po be the acyclic datalog rule Xx\ . . .Xm t— = 
A . . . A Xm = Xm - n'o just consists of the rule po, and for 1 < i < n the stratum 77' 
is obtained from 77^ by adding an atom Xy to the body of every rule p of 77^, where y is 
a tuple of variables that contains all variables of p. Clearly, 77' is equivalent to 77, and 
all of its rules are acyclic. 

An NRSD-program 77 = (77i , . . . , 77„) is in normal form if for 1 < i < j < n and 
X £ int(77i) n ext(Hj) the relation symbol X only occurs negatively in II j. 

Lemma 4.13. There is an exponential time algorithm that associates with every acyclic 
NRSD-program 77 an equivalent acyclic NRSD-program 77' in normal form. Further- 
more, if 77 is strictly acyclic, then 77' is also strictly acyclic. 

The results of Section 4 on conjunctive queries yield: 

Corollary 4.14. (1) The evaluation problem for acyclic NRSD-programs in normal 
form can be solved in time 0{f\II\ \ ■ ||.4|| • \ \n{A)\\). For strictly acyclic NRSD- 
programs in normal form, this can be improved to 0(| |77| | • 1 1.4| |). 

(2) The evaluation problem for NRSD-programs can be solved in time ■ IIt’II + 

||77|| • (1^1™+^ -F ||.4||)||77(.4)||), where w := tw(77), and in time O • IIf^II + 

||77|| • -F ||Al||)), where s := stw(77). 

The following example shows why we need the normal form in (1) and that there is 
no polynomial time translation of arbitrary acyclic programs into normal form programs. 
A similar example occurs in [11]. 

Example 4.15. Let n > 2 and Xi, . . . , be n-ary relation symbols. For 1 < i < 
j < n, 1 < 7 < n — 1 let p^. be the datalog rule 

A k-\-lX'i . . . . . . Xj—iX^Xj^^i . . . X-fi i X]^X\ . . . Xj-i 

and let pfj be the rule Xk+ix £- X^x. Set Ilk ■= {pfdl U \ ^ < i < j < n-l}. 

Then 77 = (77i, . . . , 77„_i) is a strictly acyclic NRSD-program with ||77|| = O(n^). 
For the {Ai} -structure .4 := ({1, . . . , n}, {(1, . . . , n)}) we have |77x„(Al)| = n!, and 
therefore IIx^ {A) cannot be computed in time 0(| |77| | • | |Al| |) = O(n^). 

Similar results can be obtained for stratified datalog programs with recursion (by 
iterating the results for programs without recursion). 
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Tractable F ragments of First-Order Logic. The classes of strictly acyclic NRSD-pro- 
grams and programs of strictly bounded tree-width correspond to well-known fragments 
of first-order logic. 

The guarded fragment GF is the smallest fragment of FO containing all atoms, closed 
under the Boolean operations -i, A, V and satistying: 

If a is atomic and (p is a GF-formula with free(v3) C var(o;), then for every 
tuple y of variables 3y{a A (/?) is a GF-formula. 

A GF-formula is strictly guarded if it is of the form 3y{a A yf, where free(v3) C 
var(a). Here we allow the degenerated case that y is empty, i.e. that the formula is just 
a A ip. Every GF-formula is a Boolean combination of atomic formulas and strictly 
guarded formulas. Furthermore, any GF-sentence ip is equivalent to the strictly guarded 
sentence 3y{y = y Ap). 

Theorem 4.16. (1) There is a quadratic time algorithm that associates with every stric- 
tly guarded formula an equivalent strictly acyclic NRSD-program in normal form. 
(2) There is an algorithm that associates with every strictly acyclic NRSD-program an 
equivalent disjunction of strictly guarded formulas. 

In particular, every GF-sentence is equivalent to an acyclic Boolean NRSD-program 
and vice versa. 

Remark 4.17. It seems possible to improve the translation algorithm in (1) to an 0{n ■ 
log (n) ) -algorithm. We do not believe that it can be made linear, although we cannot prove 
this. The following strictly acyclic formulas {pn)n>i seem to require NRSD-programs 
of superlinear size: We let Pi, P 2 , ... be unary and, for n > 1, P„ n-ary. Then we let 

Pn ■= RnXl ...XnA V”=i PiXi- 

The translation from NRSD-programs to first-order formulas cannot be made poly- 
nomial; this has nothing to do with being acyclic or guarded. Intuitively, it is due to the 
fact that programs correspond to directed acyclic graphs, whereas formulas correspond 
to trees. 

Remark 4.18. Gottlob, Gradel, and Veith [1 1] give a similar translation between senten- 
ces of the guarded fragment and so-called Datalog LITE-pro^ams, which are essentially 
stratified datalog programs where every rule has a strict chordal decomposition whose 
tree only consists of one node. 

The following theorem generalizes a result for conjunctive queries due to Kolaitis 
and Vardi: 

Theorem 4.19. Let k > 1 and let FO* be the set of all first-order formulas with at most 
k variables. 

(1) There is a linear time algorithm that associates with every TO^-formula an equiva- 
lent NRSD-program of strict tree-width at most {k — 1). 

(2) There is an algorithm that associates with every NRSD-program of strict tree-width 
at most {k — 1) an equivalent -formula. 




36 



J. Flum, M. Frick, and M. Grohe 



5 The Overall Picture 

The query-evaluation algorithms of Sections 3 and 4 are very similar. First we compute 
a tree -decomposition, either of the structure or the formula. Then we pass the tree three 
times. The first (bottom-up) pass is to compute all reachable states of the automaton at 
some tree -node (the sets Pt in the proof of Theorem 3.2) or all reachable assignments 
to the variables occurring at the node (the relation Qt in the proof of Lemma 4.1). 
In the second (top-down) pass we filter out all states that do not lead to an accepting 
configuration (St in the proof of Theorem 3.2, TZt in Lemma 4.1). The third (bottom- 
up) pass is to assemble the satisfying assignments from the pieces computed at every 
node (Sat( q in the proof of Theorem 3.2, St in Lemma 4.2). Note that in both cases for 
sentences we only need the first pass. 

The connection becomes clearer if in Section 4 we view the structures as automata: 
We define a non-deterministic^ tree-automaton 21^ := (Q, S, A, F) for every r-struc- 
ture A. Let r be the arity of r. The alphabet is St ■= t x Pow({(i, j) I 1 < < r}) 

xPow({(i, j) I 1 < i,j < r}). The state space consists of a state for every 
a € A and a state for every R G T,a & In the following we just view ‘=’ 
as a relation symbol in t with =-^:= {(a, a) | a G A}. The transition relation 6 
consists of all tuples (q^, q§ , (R", e, e'), q§, ) where {(i,j) G e at = a") and 
{(i,j) G e' => a' = a"). The starting relation Z\ consists of all pairs ((i?, 0, 0), ql^). 
Every state of 2l_4 is accepting, i.e. we let F := Q. 

Now let ip be an acyclic conjunctive query of vocabulary r. Then, starting from an 
arbitrary chordal decomposition of (p, we can find a chordal decomposition (T, (W^)) 
of p where T is a binary tree, together with an onto mapping A : T — at((p) such 
that for every t G T we have var(A(t)) = Xt. We define a mapping ct : T — 27,- as 
follows: For the leaves t £ T we let a(t) = (R, 0, 0), where R is the relation symbol 
occurring in X(t). For a node t with children m and U2 we let a(t) = (R, 61,62), 
where again R is the relation symbol occurring in X(t), and 6i is defined as follows (for 
i = I, 2): If X(t) = Rxi . . .Xm and X(ui) = R'yi . . .yn (note that 1 < to, n < r), then 
6 i := {(k,l) I yk = xi}. 

Suppose now that 1 ^ is a sentence. Then the automaton 21^ accepts the colored tree 
(T, 7 ) if, and only if, A\= p. The fact that our automaton is not deterministic does not 
play a role as long as we just want to decide acceptance; in the algorithm of Theorem 
3.2 we used the fact that there we had a deterministic automaton only in the third pass 
in order to compute the output efficiently. 

The correspondence breaks down for formulas with free variables. If we want to 
evaluate an MSO-formula on a structure of bounded tree-width then in the third pass of 
the automaton we have to collect additional colorings of the tree that lead to accepting 
runs. If we want to evaluate an acyclic conjunctive query in a structure A using the 
automaton 2 t_ 4 , then in the third pass we have to collect projections of accepting runs of 
the automaton. 

^ In Section 3 we only defined deterministic tree-automata. In a non-deterministic automaton 
(Q, 5, A, F) of alphabet E, instead of functions we have a transition relation 5 £ Q x Q x 
E X Q and a starting relation A £ E x Q. 
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If we restrict our attention to monadic second-order sentences of a very special form, 
there is an abstract explanation for the connection between evaluation of the MSO- 
sentences and Boolean conjunctive queries due to Feder and Vardi [9]: Both problems 
amount to deciding whether there is a homomorphism between two relational structures. 
And in both cases we use the fact that it is decidable in polynomial time if an arbitrary 
structure A contains a homomorphic image of a structure B of bounded tree -width. 
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Abstract. We consider here scalar aggregation queries in databases that 
may violate a given set of functional dependencies. We show how to 
compute consistent answers (answers true in every minimal repair of 
the database) to such queries. We provide a complete characterization 
of the computational complexity of this problem. We also show how 
tractability can be obtained in several special cases (one involves a novel 
application of the perfect graph theory) and present a practical hybrid 
query evaluation method. 



1 Introduction 

While integrity constraints capture important semantic properties of data, they 
are often unenforceable if data comes from different, autonomous sources (thus 
the integrated database may be inconsistent with the constraints) . The notion of 
a consistent query answer [2] attempts to reduce this tension by using constraints 
to qualify query answers. A consistent answer is, intuitively, true regardless of 
the way the database is fixed to remove constraint violations. Thus answer con- 
sistency serves as an indication of its reliability. 

Consistent query answers are potentially important in a datawarehouse con- 
text, where inconsistencies are likely to occur as the effect of the integration 
of data sources, with duplicate information, or delayed refreshment of the wa- 
rehouse. In addition, it is in datawarehousing where aggregation queries are 
particularly important because they are used, in combination with OLAP me- 
thodologies, to better understand, in a global way, the peculiarities of clients, 
market and business behavior, and to support decision making. 

In [2], in addition to a formal definition of a consistent query answer, a 
computational mechanism for obtaining such answers was presented. However, 
the queries considered were just first-order queries. Here we address in the same 
context the issue of aggregation queries. We limit, however, ourselves to single 
relations that possibly violate a given set of functional dependencies (FDs) . 

In defining consistent answers to aggregation queries we distinguish between 
queries with scalar and aggregation functions. The former return a single value 
for the entire relation. The latter perform grouping on an attribute (or a set of 
attributes) and return a single value for each group. Both kinds of queries use 
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the same standard set of SQL-2 aggregate operators: MIN, MAX, COUNT, SUM, and 
AVG. In this paper, we address only aggregation queries with scalar functions. 

Example 1. Assume we have the following database instance Salary (we are 
identifying the table with the database instance) 



Salary 


Name 


Amount 




V .Smith 


5000 




V .Smith 


8000 




P .Jones 


3000 




M .Stone 


7000 



and F is Name — >■ Amount, meaning that Name functionally determines 
Amount, that is violated by the table Salary, actually by the tuples with the 
value V. Smith in attribute Name. If we pose the query MIN(Amount) to this 
database, we should get, independently of how the violation is fixed, the value 
3000. Nevertheless, if we ask MAX(Amount), we have a problem, because the 
maximum, 8000, comes from a tuple that participates in the violation of the 
functional dependency. 

In [2] we defined an answer to a query posed to an inconsistent database as 
consistent when that same answer is obtained from every possible repair of the 
given database instance. Here, a repair is a new database instance that satisfies 
the given integrity constraints (ICs) and departs in a minimal way from the 
original database (see Section 2.1). In our case, the possible repairs are 



Salary i 


Name 


Amount Salary 2 


Name 


Amount 




V .Smith 


5000 


V .Smith 


8000 




P .Jones 


3000 


P. Jones 


3000 




M .Stone 


7000 


M .Stone 


7000 



In each repair MIN (Amount) returns the same value: 3000. On the other hand, 
MAX (Amount) returns a different value in each repair: 7000 or 8000. Thus, in the 
second case, there is no single consistent answer in the sense we had defined it. 
Nevertheless, an answer given by the initial database in the form of the inter- 
val [6000,9000], meaning that in every repair the maximum lies between 6000 
and 9000, could be considered a consistent answer. In particular, we might be 
interested in getting, as a more accurate consistent answer, the smallest pos- 
sible interval (the optimal lower and upper bounds), in this case the interval 
[7000,8000]. 



Example 2. Consider the EB\ StNumber — >■ Name and the inconsistent database 
instance 



Jobs 


StNumber 


Name 


Activity 




980134 


D. Singh 


TeachAsst 




980134 


F.Chen 


ResAsst 




980134 


D. Singh Programmer 



This instance has two possible repairs 
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Jobsi 


StNumber 


Name 


Activity 


Jobs2 


StNumber 


Name Activity 




980134 

980134 


D. Singh TeachAsst 
D. Singh Programmer 




980134 


F. Chen ResAsst 



If we pose the query COUNT(Jobs) to these repairs, we obtain two different 
answers, 2 and 1, respectively. Thus, the optimal consistent answer is the interval 
[ 1 , 2 ]. □ 



Therefore, for aggregation queries we have to weaken a bit the notion of 
consistent query answer to allow answers that are not single values, but intervals. 

In Section 2, we provide a general definition of consistent answer to an ag- 
gregation query with scalar functions. We also define a graph-theoretical repre- 
sentation of the database repairs, which is specifically geared towards FDs. In 
Section 3, we study the data complexity of the problem of computing consistent 
answers to aggregation queries in inconsistent databases. In Section 4, we show 
how to reduce the computational cost of computing such answers by decompo- 
sing the computation into two parts: one that involves standard relational query 
evaluation and one that computes the consistent answers in a smaller instance. 
In Section 5, we show that the complexity of computing consistent answers can 
be reduced by exploiting special properties of the given set of FDs or the given 
instances. In Section 6 we discuss related and further work. 



2 Basic Notions 

In this paper we assume that we have a fixed database schema containing only 
one relation schema R with the set of attributes U. We will denote elements of 
U by subsets oi U hy X,Y, , and the union of X and Y by XY. 

We also have two fixed, disjoint infinite database domains: D (uninterpreted 
constants) and N (numbers). We assume that elements of the domains with 
different names are different. The database instances can be seen as first order 
structures that share the domains D and N. Every attribute in U is typed, thus 
all the instances of R can contain only elements either of I? or IV in a single 
attribute. Since each instance is finite, it has a finite active domain which is a 
subset oi DiJN . As usual, we allow built-in predicates over N that have infinite 
extensions, identical for all database instances. There is also a set of functional 
dependencies F over R that captures the semantics of the database. E.g., it may 
express the property that an employee has only a single salary. The instances 
of the database do not have to satisfy F (because the database may contain 
integrated data from multiple sources). A database that violates a given set of 
FDs is called FD-inconsistent. 



2.1 Repairs 

Given a database instance r, we denote by X{r) the set of formulas {P{a) \ r 1= 
P{a)}, where P is a relation name and a a ground tuple. 
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Definition 1. The distance A{r,r') between data-base instances r and r' is the 
symmetric difference: A(r,r') = (^(r) — TJ{r')) U {TJ{r') — S{r)). 

Definition 2. For the instances r,r',r" , r' <r r" if A{r,r') C A{r,r"), i.e., if 
the distance between r and r' is less than or equal to the distance between r and 



Definition 3. Given a set of FDs F and database instances r and r' , we say 
that r' is a repair of r w.r.t. F if r' \= F and r' is <r-minimal in the class of 
database instances that satisfy the set of FDs F. □ 

We denote by Repairs p(r) the set of repairs of r w.r.t. F. Examples 1 and 2 
illustrate the notion of repair. For a set of FDs, F, repairs are always obtained 
by deleting tuples from the table. For every instance r, the union of all repairs 
of r w.r.t. F is equal to r. These properties are not necessarily shared by other 
classes of ICs. 

Definition 4. The core of r is defined as Corepff) = Hr' e Repairs p{r) 

The core is a new database instance. If r consists of a single relation, then 
the core is the intersection of all the repairs of r. The core of r itself is not 
necessarily a repair of r. In example 1, the core is the table containing the tuples 
(P. Jones, 3000) and (M.S'tone, 7000) only. In example 2, the core is empty. 

2.2 Consistent Qnery Answers 

First Order Queries. Query answers for first order queries are defined in the 
standard way. 

Definition 5. Given a set of integrity constraints F, we say that a (ground) 
tuple i is a consistent answer to a query Q{x) in a database instance r, and we 
write r \=F Q{t) (or r Q{x)(f\), if for every r' € Repairs pir) , r' \= Q{t). If 

Q is a sentence, then true (false ) is a consistent answer to Q in r, and we write 
r Q (r -'Q), if for every r' € Repairs pfr), r' N Q (r' Q). 

Aggregation Queries. The aggregation queries we consider are queries of the 
form: SELECT f ( . . . ) FROM R, where f is one of the aggregate operators MIN, 

MAX, COUNT, SUM, and AVG, applied to an attribute or the entire relation (as 
with the COUNT (*)). These queries return single numerical values by applying 
the corresponding scalar function, i.e., minimum for MIN, etc. In general, / will 
denote an aggregation query (or a scalar function itself). We write r |= / = a to 
express that the aggregation query / returns the value a in the instance r. 

Definition 6. Given a set of integrity constraints F , we say that a numerical 
interval [a,b], with — oo < a < b < oo, is a consistent answer to an aggregation 
query f in a database instance r, and we write r \=p f G [a,b] (or r \=p a < 
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f ^ b) if for every r' € Repairs p{r) , r' returns to the query f a value v such that 
a < V < b. If [a, b] is a consistent answer, then a is called a lower-bound-answer 
and b an upper-bound-answer. An interval is an optimal consistent answer if no 
subinterval is a consistent answer. If [a, b] is an optimal consistent answer, then 
a is called the greatest-lower-bound-answer (glb-answerj and denoted glbp{f,r), 
and b the least-upper-bound-answer (lub-answerj and denoted lubp{f,r). □ 

We will be particularly interested in obtaining optimal consistent answers by 
querying the possibly inconsistent database, without computing and checking all 
possible repairs. 

Note: Our notion of consistent query answer for aggregation queries with 
scalar functions has some shortcomings. For instance, while we guarantee that 
the value of the scalar function in every repair falls within the returned interval, 
clearly not every value in this interval will correspond to the value of the function 
obtained in some repair. Perhaps it is more natural for such queries to return 
a set of values, each corresponding to the value of the function in some repair. 
Along the same lines, one could represent such a set as an OR-object [12] or a 
C-table [11]. However, the interval-based representation is exponentially more 
compact than any explicit set-based representation. 

Example 3. Consider the functional dependency A ^ B and the following data- 
base instance vq (columns represent tuples): 



ro 




A 


1 


1 


2 


2 


■ * n n 


B 


0 


1 


0 


2 


0 2”-i 



The scalar function involving summing on the B attribute will assume each value 
between 0 and 2" — 1 in some repair of tq. Therefore, any set-based representation 
of set of all of those values will be of exponential size. On the other hand, the 
interval-based representation [0, 2” — 1] has polynomial size. □ 

In addition to consistent answers, we will also consider other auxiliary notions 
of query answers in inconsistent databases. 

Definition 7. A value v is a core answer w.r.t. F to f in r if 

v = f{ Pi r'). 

r' ^Repairs p{r) 



A value v is a union answer w.r.t. F to f in r if 

v = f{ U /). 

r'^ Repairs p{r) 



Union answers are trivial for FDs, as the union of all the repairs of r is r 
itself, so the union answer reduces to f{r). 
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2.3 Graph Representation 

Given a set of FDs F and an instance r, all the repairs of r w.r.t. F can be 
succinctly represented as a graph. 

Definition 8. The conflict graph Gp,r is an undirected graph whose set of ver- 
tices is the set of tuples in r and whose set of edges consists of all the edges 
such that there is a dependency X ^ Y € F for which ti[X] = t2[X] 
and ti\Y] ^ complement conflict graph Gp^r is the complement of 

the conflict graph. 

Example 4 .. Consider a schema R{AB), the set F of two functional dependencies 
A ^ B and B ^ A, and an instance r = {(oi, 6i), (ai, 62), (02, ^2), (02, ^1)} over 
this schema. The conflict graph Gp^r looks as follows: 

(oi j bi) (oi, 62) 

(o2, bi) (o2, 62) 

Proposition 1. Each repair in Repairs p{r) corresponds to a maximal indepen- 
dent set in Gp^r (or a maximal clique in Gp^r) and vice versa. □ 

The above graphs are geared specifically towards FDs. The repairs of other 
classes of constraints do not necessarily have similar representations. 

2.4 Computational Complexity 

Data Complexity. The data complexity assumption [7,15] makes it possible to 
study the complexity of query processing as a function of the size of the database 
instance. 

Definition 9. Given a class of databases T>, a class of queries C and a class of 
integrity constraints, the data complexity of computing consistent query answers 
is defined to be the complexity of (deciding the membership of) the sets Dp ^p = 
{{D,t) : D 4 >[t\} for a fixed <f) € C and a fixed finite set F of integrity 
constraints. This problem is G-data-hard for a complexity class C if there is a 
query (p € C and a finite set of integrity constraints F such that Dp^p is C-hard. 

Upper and Lower Complexity Bounds. We view computing gib- and lub- 
answers as an optimization problem. It is easy to see that for all SQL scalar ag- 
gregation queries the data complexity of this problem is in NPO - the class of op- 
timization problems whose associated decision problems are in NP [4] . In several 
cases, we will show that computing gib- and lub-answers is in PO (polynomial- 
time computable optimization problems). To show intractability of computing 
a gib- (or lub)-answer to /(r) for an aggregation query /, we will demonstrate 
that the decision problem glbp{f, r) (or lubp{f, r)) 9 k (where 9 G {<, >}) is NP- 
hard. If the latter is the case, then clearly computing the appropriate consistent 
answer is not in PO, unless P=NP. 
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3 Scalar Aggregation 

Computing consistent answers by producing all the repairs of a database instance 
and then computing the aggregation queries for each of them may have a high 
complexity. The following instance ri with 2n tuples (columns represent tuples): 



ri 




A 


1 


1 


2 


2 


n n 


l3 


0 


T 


0 


1 


■■ 0 1 



has 2" possible repairs for the single FD A ^ B. So, in general, computing all re- 
pairs and then evaluating a query in each repair is not feasible. We have identified 
two ways of computing consistent answers by querying the given, inconsistent 
database instance, without having to compute all the repairs. Query transfor- 
mation modifies the original query, Q, into a new query, T(Q), that returns only 
consistent answers. We have applied this approach in [2] to restricted first order 
queries and universal integrity constraints. Except in some simple cases, this ap- 
proach does not seem applicable to aggregation queries. For example, even when 
MAX (A) and MIN (A) queries can be written as first order queries, their resulting 
syntax does not allow to apply the methodology developed in [2] to them. Here, 
we use instead the fact that for FDs, the set of all repairs of an instance can 
be compactly represented as the conflict graph or its complement. We develop 
techniques and algorithms geared specifically towards this representation. 

3.1 Core Answers 

We start by considering core answers. For some aggregation operators, e.g., 
COUNT and SUM of nonnegative values, a core answer is a lower-bound-answer, 
but not necessarily a gib-answer. As we will see in Section 4, computing core 
answers to aggregation queries can be useful for computing consistent answers. 

Theorem 1. The data complexity of computing core answers for any scalar 
function is in PTIME. 

Proof: The core consists of all the isolated vertices in the conflict graph. □ 
In general, computing gib-answers and lub-answers is considerably more in- 
volved than computing core answers. We consider each aggregation operator in 
turn. In the following, r denotes an instance of a schema R. 

3.2 MIN and MAX 

Consider MAX (A) (MIN (A) is symmetric). In this case computing the lub-answer 
in r w.r.t. an arbitrary set of FDs F consists of evaluating MAX (A) in r. However, 
it is not obvious how to compute the gib-answer, namely the minimum of the set 
of maximums obtained by posing the query MAX (A) in every repair. Computing 
MAX (A) in Corepir) gives us only a lower-bound-answer which does not have to 
be the gib-answer. We first provide a definition and prove a lemma which will 
also be useful later. Recall that U is the set of all attributes of the schema R. 
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Definition 10. An FD X ^ Y is a partition dependency over R if XUY = U 
and X r\Y = 0. 

Lemma 1. For any instance r of R and any partition dependency d = X ^ Y 
over R, the conflict graph Gd,r is a union of disjoint cliques. 

Proof: Assume (^ 1 ,^ 2 ) and are two edges in Gd,r such that t\ ^ tz- 

Then tflX] = t 2 [X], ti\Y] ^ t 2 \Y], t 2 [X] = t 3 [A], and t 2 \Y] ^ h\Y]. Therefore 
ti[X] = t 3 [A]. Also, ti\Y] yf tz\Y] because otherwise t\ and tz would be the 
same tuple. So (^ 1 ,^ 3 ) is an edge in Gd^r- 

Theorem 2. The data complexity of computing glbp{KA.X{A),r) in r for a set of 
FDs F consisting of a single FD X ^ Y is in PTIME. 

Proof: Consider first the case where the FD is a partition dependency. Then 
by Lemma 1 the conflict graph Gp^r is a union of disjoint cliques Ci, . . . , Gk- 
Every repair picks exactly one tuple from each clique. Consider a tuple t in a 
clique Cj, 1 < j < k. The value t[A\ is a maximum in a repair iff for every clique 
Ci, 1 < i < k, there is a tuple t' in Gt such that t'[A] < t[A]. This condition 
can be tested in PTIME because the cliques are in our case just the connected 
components. Denote the set of all maximum values determined in this way as S. 
Then the gib-answer to MAX (A) is the minimum value in S. 

A slight complication arises if the FD is not a partition dependency. The 
schema may contain some attributes other than those in XY . Let’s call two tuples 
t and t' XY -overlapping if = t'[XY], There may be two different XY- 

overlapping tuples which are not in conflict although they are both in conflict 
with some other tuple. Thus, the conflict graph is not necessarily a union of 
disjoint cliques. However, it is easy to see that AF-overlapping tuples are always 
together in a repair. Therefore only the tuples with the maximum value of A 
among all AE-overlapping tuples can have a maximum value in a repair. All the 
remaining tuples can be removed without affecting the set of maximum values 
in repairs. If there is more than one tuple with the maximum value, an arbitrary 
one is selected. Denote the instance obtained in this way as r'. The conflict graph 
Gp^r' is a union of disjoint cliques and the procedure described in the previous 
paragraph can be applied. □ 

Theorem 3. There is a set of 2 FDs Fg for which deciding whether glbpg{nk.X 
(A)) r) < k in r is NP-data-hard. 

Proof: We reduce SAT to our problem. Consider a propositional formula p : 
Cl A - • • AC„ in CNF. Let pi, . . .p™ be the propositional variables in p. Construct 
a relation r with the list of attributes A, B, C, D and containing exactly the 
following tuples: 

1 . (pi, 1 , Cj, 1 ) if making pi true makes Gj true, 

2 . (pi, 0, Cj, 1 ) if making pi false makes Cj true, 

3. {w, w, Cj, 2), 1 < j < n, where ru is a new symbol. 

Consider also the FDs A ^ B (each propositional variable cannot have more 
than one truth value) and C ^ D. The crucial observation is that the glbpg{nk.X 
(D),r) = 1 iff p is satisfiable. □ 
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3.3 C0UNT(*) and SUM 

We consider only CDUNT(*): SUM is very similar. 

Theorem 4. If the set of FDs F is equivalent to a single dependency X — >■ 
Y, X r\Y = 0, the data complexity of computing gZ&^(CDUNT(*), r) (or lubp 
(C0UNT(*), r)) in r is in PTIME. 

Proof: The gib-answer can be computed using the following set of SQL views 
(the lub-answer is obtained in a similar way): 

CREATE VIEW S(X,Y,C) AS 

SELECT X,Y, COUNT (*) FROM R 
GROUP BY X,Y; 

CREATE VIEW T(X,C) AS 
SELECT X, MIN(C) FROM S 
GROUP BY X; 

SELECT SUM(C) FROM T; □ 

To characterize the remaining cases, we prove two lemmas about maximum 
cliques in conflict graphs. 

Lemma 2. There is a set of 2 FDs Fi for which the problem of determining the 
existence of a repair of r of size > k is NP-data-hard. 

Proof: Reduction from 3-COLORABILITY. Given a graph G = {N, E) , with 
A^={1,2,... ,n}, and given the colors w (white), b (blue) and r (red), we define 
the relation P{A, B, C, D) by means of the following rules: 

1. for every 1 < i < n, (i, w, i, w) G P, {i, b, i,b) € P and (i, r, i, r) G P. 

2. for every {i,j) G E, (i,w,j,b) G P, (i,w,j,r) G P, (i,b,j,w) G P, (i,b,j,r) G 
P, ihr,j,w) G P and (i,r,j,b) G P. 

We consider the set of functional dependencies A ^ B and C ^ D. The crucial 
property is that G is 3-colorable iff there is a repair P' of P with exactly n-|-2- |i?| 
tuples (the maximum possible number of tuples in a repair). □ 

Lemma 3. There is a set of 2 FDs F 2 for which the problem of determining the 
existence of a repair of r of size < k is NP-data-hard. 

Proof: Modification of the lower bound proof of Theorem 3. We build the 
instance by using the same tuples of the kind (1) and (2), as well as sufficiently 
many tuples of the kind (3), each with a different new symbol w. It is enough 
to have 3n(n -F 1) such tuples, where n is the number of clauses. The formula is 
satisfiable iff there is a repair of size < 3n. □ 

The lemmas 2 and 3 imply the following theorems. 

Theorem 5. There is a set of two FDs F\ for which determining whether 
^M&i^’,^(CDUNT(*),r) > k in r is NP-data-hard. 
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Theorem 6. There is a set of two FDs for which determining whether 
5/6^2 (*^0UNT(*), r) < k in r is NP-data-hard. □ 

The above results establish the intractability of determining lub-answers and 
gib-answers to COUNT (*) in a general setting. Similar results hold for SUM. We 
will see that the boundary between the tractable and the intractable can be 
pushed farther in several special cases. 

3.4 COUNT (A) 

We assume here that distinct values of A are counted (COUNT (DISTINCT A)). 

Theorem 7. There is a single FD do = B ^ A for which determining whether 
glbj^^{CO\J]:n{k),r) < k in r is NP-data-hard. 

Proof: To see that the lower bound holds, we will encode an instance of the 
HITTING SET problem in r. For every set Si in the given collection C and 
every element x € Si we put the tuple (i, x) in r. There is in C a hitting set 
of size less than or equal to k iff there is a repair of r with at most k different 
values of the first attribute A. □ 

Theorem 8. There is a single FD d\ = B ^ A for which determining whether 
Irt&di (COUNT(A), r) > k in r is NP-data-hard. 

Proof: We reduce SAT to this problem. Let the instance r be the conjunction 
of clauses ip : Ci A . . . A C„. Consider the functional dependency X ^ Y and 
the database instance r{X, Y, A) with the following tuples: 

1. {pi, l,Cj) if making pi true makes Cj true. 

2. (pi,0,Cj) if making pi false makes Cj true. 

Then, p is satisfiable iff (COUNT(A), r) >n. □ 

3.5 AVG 

Theorem 9. If a set of FDs F is equivalent to a single dependency X ^ Y , with 
X (lY = 0, then the data complexity of the problem of computing glbp{A\IG{A.),r) 
(or 1m&f(AVG(A), r)^ in r is in PTIME. 

Proof: ^ First, the problems of finding the gib and lub answers for AVG with 
one functional dependency can be reduced in polynomial time to the following 
problem: 

PI: There are m bins. Each bin contains objects of different colors. No two 
bins have objects of the same color. All objects of the same color have the same 
weight. One has to choose exactly one color for each bin in such a way that the 
sum of the weights of all objects of the chosen colors divided by the total number 
of such objects (i.e., the average weight AVG of objects of the chosen colors) is 
maximized. 



^ The proof of this theorem is due to Vijay Raghavan and Jeremy Spinrad. 
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To solve PI, consider the well-known “2-OPT’ strategy of starting with an 
arbitrary selection (ci, C 2 , Cm) of one color each from each of the m bins. The 
2- OPT strategy is simply to replace a color from one bin with a different color 
from the same bin if so doing increases the value of the average weight of objects 
of the colors in the selection. 

This 2-OPT strategy can be shown to converge to the optimum. In addition, 
it can be designed in such a way the it runs in polynomial time. □ 

Theorem 10. There is a set of two FDs for which determining whether 
glbp^{AVG{k),r) < k in r is NP-data-hard. 

Proof: We can use the same reduction as in theorem 3. There is a satisfying 
assignment iff there is a repair for which AVG(D) = 1 (otherwise the gib-answer 
is bigger than 1) iff glbp^{AVG{D),r) <1. □ 

Theorem 11. There is a set of two FDs F 4 for which determining whether 
lubF^{AMG{k),r) > k in r is NP-data-hard. 

Proof: We reduce SAT to our problem. Change the tuples of the instance in the 
proof of theorem 3 as follows: 

3’. {w, w, Cj, —2), 1 < j < n, where w is a new symbol. 

There is a satisfying assignment iff gib F^{A'VG{'D),r) >1. □ 

3.6 Summary of Complexity Results 

It is easy to show that each of the problems considered before belong to the class 
NP. 
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4 Hybrid Computation 

As we have seen, determining gib-answers and lub-answers is often computatio- 
nally hard. However, it seems that hard instances of those problems are unlikely 
to occur in practice. We expect that in a typical instance a large majority of 
tuples are not involved in any conflicts. If this is the case, it is advantageous to 
break up the computation of an lub-answer to / in r into three parts: (I) the 
computation of / in the core of r, (2) the computation of an lub-answer to / 
in the complement of the core of r (which should be small), and (3) the com- 
bination of the results of (1) and (2). The step (1) can be done using a DBMS 
because the core of r can be computed using a first-order query (Theorem 1). 
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Definition 11. The scalar function f admits a ^-decomposition of its lub- 
answers (resp. gib- answers) w.r.t. a set of FDs F if for every instance r of R, the 
lub-answer (resp. gib-answer) v to f satisfies the condition v = g{f{Corep{r)), 
v'), where v' = lubp{f,r — Corepir)) (resp. v' = glbp{f,r — Corep{r))). 

Theorem 12. The following pairs describe g- decompositions admitted by scalar 
functions f: 

1. / = MIN (A), g = min; 

2. f — MAX (A), g = max; 

5. / = C0UNT(*), g = -k; 

4. f ^ SUM(A), 5 = -k. 

5 Special Cases 

We consider here various cases when the conflict graph (or its complement) has 
some special form that could be used to reduce the complexity of computing 
answers to aggregation queries. 

5.1 BCNF 

We show here that if the set of FDs F has two dependencies and the schema 
R is in BCNF, computing lub-answers can be done in PTIME. This should be 
contrasted with Theorem 5 which showed that two dependencies without the 
BCNF assumption are sufficient for NP-hardness. 

Lemma 4. If R is in BCNF and F is equivalent to a set of FDs with 2 de- 
pendencies, then F is equivalent to a set of FDs with 2 partition dependencies 
and N 2 — y Y 2 . tD 

Therefore, WLOG we can assume that |F| = 2 and F = {^ 1 ,^ 2 } where di 
and d 2 are different partition dependencies. (The case of |F| = 1 has already 
been shown to be in PTIME, even without the BCNF assumption.) 

Definition 12. A chord in a cycle is an edge connecting two nonconsecutive 
vertices of the cycle. 

Lemma 5. Every cycle of length k where k is odd and k > 3 in has 

a chord. 

Proof: Such a cycle has two consecutive edges (^ 1 ,^ 2 ) and (t 2 ,H) that belong 
both to Gdi,r or both to Gd^.r- Therefore, by Lemma 1 the edge {ti,t^), which 
is a chord, also belongs to one of those graphs, and consequently to G{dx,d 2 },r- 
□ 

Note: For the above property to hold, it is essential for the cycle in the 
conflict graph to be odd. Example 4 shows an even cycle of length 4 that does 
not have a chord. That implies that conflict graphs in the case of two FDs are 
not necessarily chordal [5] and thus efficient algorithms for the computation of 
maximum independent set in such graphs [9] are not applicable. 
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Lemma 6. Every cycle of length k where k is odd and k > 3 in the complement 
conflict graph G{di,<i 2 }.r ® chord. 



Proof: To give the idea of the proof, we consider the case of R{A, B) and 



d\ — A — y B and d 2 — B — y A. 



Assume (ti, t 2 j • ■ • ,tk, ti) is a cycle in Let U = (a^, bi), 1 < i < k, 



where the afs and bfs are distinct variables. We write down the formula (j) that 
expresses the property that the consecutive vertices in the cycle are in G{di,d. 2 }y- 
<f) = [\ai ^ Ui+i A f\bi ^ bi+i, where the indexes are interpreted cyclically, i.e., 
A: + 1 = 1. Now we write down the formula tp that expresses the property that 
there are no chords in the cycle. This formula is a conjunction of the formulas 
ipij (for every pair (i,j) of nonconsecutive vertices in the cycle) that express 
the property that there is a conflict between U and tj\ ipij = (oi = aj Abi ^ 
bj V bi — bj A Qi ^ Oj) = (tti = Oj V bi = bj) A (oi ^ Oj V bi ^ bj). Therefore f/'yj 
postulates at least one equality: Oi = aj or bi = bj. 

Now the counting argument. Assume (f> A is satisfiable. The formula </> 
postulates n inequalities between the afs and n inequalities between the bfs. 
The formula ip postulates inequalities and the same number of equalities 

that involve either afs or bfs. WLOG we assume that at least half of them, 
i.e., j involve a^’s. Therefore, for n > 5, the equalities imply together yet 

another equality. (The assumption that all the equalities holding have disjoint 
variables leads to contradiction.) Thus the total number of equalities is +1. 

Now 



2?r T 



n(n — 3) 
2 



n{n — 3) 
2 



+ 1 = n(n 



1 ) + 1 



and is greater than the number of 2-element sets consisting only of afs or bPs. 
Therefore for some i and j, we have both Oi = Oj and Oi yf Oj (or bi = bj and 
bi bj), which contradicts the satisfiability of <p A ip. Thus an odd cycle of length 
> 5 has to have a chord. □ 



Definition 13. A graph is perfect if its chromatic number is equal to the size 
of its maximum clique. 

Strong Perfect Graph Conjecture: A graph G is perfect iff every odd cycle 
in G or G has a chord. 

This conjecture has been shown to hold for many classes of graphs, including 
claw-free graphs [5]. 

Definition 14. A graph is claw-free if it does not contain an induced subgraph 
(Vo,Eo) where Vo = {ti,t2,t3,t4.} ond = {(^ 2 , G), (^3, G). (G, G)}- 



Lemma 7. If R is in BCNF over F = {di, ^ 2 }; then for every instance r of R, 
the conflict graph G^d,^ d 2 },r G claw-free and perfect. 

Proof: Assume that the conflict graph contains a claw (Vb, Eq) where Vq = 
{G,G,G)G} and Eq = {(^ 2 , G)i (G) G)) (G, G)}- Then two of the edges in Eq, 
say (t 2 ,G) and (GGi) come from one of Gdi,r or Gd 2 ,r- But the by Lemma 1, 




52 



M. Arenas, L. Bertossi, and J. Chomicki 



the edge (ta, t^) also belongs to that graph, and consequently to G{di,d 2 },r- Thus 
the subgraph induced by Vq is not a claw. 

As the conflict graph is claw-free, the Strong Perfect Graph Conjecture holds 
for it and Lemmas 5 and 6 yield together the fact that it is perfect. □ 

Theorem 13. If R is in BCNF and the given set of FDs F is equivalent to one 
with at most two dependencies, computing Zm&f(CDUNT(*), r) in any instance r 
of R can he done in PTIME. 

Proof: The theorem follows from Lemma 7 and the fact that in perfect claw-free 
graphs computing a maximum independent set can be done in 0(n®'®) [10]. □ 

What about |F| > 2? In this case the conflict graph does not have to be 
claw-free, so it is not clear whether the Strong Perfect Graph Conjecture holds 
for it. The conflict graph does not even have to be perfect. Take a conflict 
graph consisting of a cycle of length 5 where the edges corresponding to the 
dependencies d\, d 2 and d^ alternate. The chromatic number of this graph is 3, 
while the size of the maximum clique is 2. 



5.2 Disjoint Union 

Theorem 14. If the instance r is the disjoint union of two instances that sepa- 
rately satisfy F, computing ^m&f(CDUNT(*), r) can he done in PTIME. 

Proof: In this case, the only conflicts are between the parts of r that come from 
different databases. Thus the conflict graph is a bipartite graph. For bipartite 
graphs determining the maximum independent set can be done in PTIME. □ 
Note that the assumption in Theorem 14 is satisfied when the instance r is 
obtained by merging together two consistent databases in the context of database 
integration. 



6 Related and Further Work 

We can only briefly survey the related work here. A more comprehensive dis- 
cussion can be found in [2]. The need to accommodate violations of functional 
dependencies is one of the main motivations for considering disjunctive databa- 
ses [12,14] and has led to various proposals in the context of data integration [1, 
3,8,13]. A purely proof-theoretic notion of consistent query answer comes from 
Bry [6]. None of the above approaches considers aggregation queries. 

Many further questions suggest themselves. First, is it possible to identify 
more tractable cases and to reduce the degree of the polynomial in those already 
identified? Second, is it possible to use approximation in the intractable cases? 
The INDEPENDENT SET problem is notoriously hard to approximate, but 
perhaps the special structure of the conflict graph may be helpful. Finally, it 
would be very interesting to see if our approach can be generalized to broader 
classes of queries and integrity constraints. 




Scalar Aggregation in FD-Inconsistent Databases 



53 



Finally, alternative definitions of repairs and consistent query answers that 
include, for example, preferences are left for future work. Also, one can apply 
further aggregation to the results of aggregation queries in different repairs, e.g., 
the average of all MAX (A) answers. 
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Abstract. We establish the equivalence of: (1) the logical implication 
problem for a description logic dialect called DLClass that includes a 
concept constructor for expressing uniqueness constraints, (2) the logical 
implication problem for path functional dependencies (PFDs), and (3) 
the problem of answering queries in deductive databases with limited 
use of successor functions. As a consequence, we settle an open problem 
concerning lower bounds for the PFD logical implication problem and 
show that a regularity condition for DLClass that ensures low order 
polynomial time decidability for its logical implication problem is tight. 



1 Introduction 

Description Logics (DLs) have many applications in information systems [2]. 
They can facilitate data access to heterogenous data sources because of their 
ability to capture integrity constraints manifest in object relational database 
schema, in ER diagrams or UML class diagrams and that arise in practical 
XML applications, including the constraints that underly XML document type 
definitions (DTDs) [10]. They are particularly valuable for solving problems in 
information integration and that arise in query optimization [4,8,16,17]. Howe- 
ver, in many of these applications, it becomes essential for a particular DL dialect 
to capture knowledge that relates to various kinds of uniqueness constraints that 
are satisfied by the possible data sources. 

For example, consider a hypothetical integrated patient management system 
depicted in Figure 1. It can be crucial to know for this system 

1. that a hospital has a unique name, 

2. that a patient is uniquely identified by hospital and patient number, and 

3. that a person admitted to a hospital has a valid unique social security num- 
ber. 

DL dialects that enable capturing keys in the standard database sense and simple 
forms of functional dependencies have been proposed in [4,7]. Such a facility was 
a nagging missing ingredient for prior DL dialects that had efficient subsumption 
checking algorithms. This was achieved by adding a new kind of concept con- 
structor (fd). Also, Calvanese et al. have recently demonstrated how a variety 
of keys with set-valued components can be added to a very general DL dialect 
without increasing the complexity of reasoning [9]. 

* An earlier version of this paper has appeared in the informal proceedings of the 
International Workshop on Description Logics DL2000. 
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Inheritance 



Attribute 



Fig. 1. Patient Data Integration. 



In this paper, we consider a more general version of fd in which component 
attribute descriptions may now correspond to attribute or feature paths [15,16, 
17,20]. To focus on the essential idea, we define a very simple DL dialect called 
DLFD that consists of a single concept constructor corresponding to this more 
general version of fd. The logical implication problem for DLFD is therefore a 
special case of the logical implication problem for path functional dependencies 
(PFDs), a variety of uniqueness constraints for data models supporting complex 
objects that was first proposed in [20] and studied more fully in a subsequent 
series of papers [15,19,21]. 

Although DLFD is extremely simple, it can be used to simulate logical im- 
plication problems in a more general object relational dialect via a linear trans- 
lation. DLClass, as we shall call it, includes additional concept constructors that 
directly capture class inheritance and attribute typing, a capability that is es- 
sential to the convenient capture of meta-data relating to possible data sources 
in information systems. 

For example, consider again the hospital schema in Figure 1 for the above- 
mentioned patient management system. The schema captures many of the con- 
straints that underly an information source that contains XML data with the 
structure illustrated in Figure 2. The schema, perhaps given more directly by an 
associated DTD for the data, can be captured in DLClass as a terminology that 
consists of the following subsumption constraints (cf. Definition 1). 

ADMISSION < (and (all Date DATE) (all Patient PATIENT)) 
PATIENT < (all Hospital HOSPITAL) 

PATIENT < (and PERSON (all Number INTEGER)) 

PERSON < (and (all Name STRING) (all SSN INTEGER)) 
HOSPITAL < (all Name STRING) 

VALID-SSN < INTEGER 

Furthermore, DLClass can be used to capture the three uniqueness constraints 
for the patient management system listed at the beginning of this section by 
adding the following additional subsumption constraints to the terminology. 
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<ADMISSION> 

<PATIENT Number=123> 

<PERS0N SSN=45678> 

<NAME>Fred</NAME> 

</PERSON> 

<H0SPITAL> 

<NAME>Sunny Brook</NAME> 

</HDSPITAL> 

</PATIENT> 

<DATE>20-January-2000</DATE> 

</ADMISSIDN> 

Fig. 2. Patient Data in XML. 

HOSPITAL < (fd HOSPITAL : Name Id) 

PATIENT < (fd PATIENT : Hospital, Number -)■ Id) 

ADMISSION < (all Patient (and (all SSN VALID-SSN) 

(fd PERSON : SSN Id ) )) 

Again, we show that DLFD is sufficiently powerful to simulate all inferences 
in DLClass by exhibiting a linear answer preserving translation of DLClass to 
DLFD. However, this translation is indirect: it relates the logical implication 
problems in DLFD and DLClass to query answering in Datalog„s, a deductive 
query language with limited use of successor functions [11,12]. This relations- 
hip leads to the main contribution of the paper; an open issue relating to the 
complexity of reasoning about PFDs is resolved. By proving an equivalence to 
query answering in Datalog„ 5 , the logical implication problems for DLFD and 
DLClass are DEXPTIME-complete, and therefore the exponential time decision 
procedure [15] becomes tight. 

The example constraints in DLClass given above have a restricted form that 
satisfies a syntactic regularity condition (cf. Section 5), which leads to incre- 
mental polynomial time algorithms for a restricted class of logical implication 
problems in DLClass [16]. Using the tight translation between DLClass and 
Datalogns we show that a similar condition can be applied to Datalog„s pro- 
grams. This leads to a PTIME query evaluation procedure for a syntactically 
restricted class of Datalog „5 programs. In addition, the condition turns out to 
be as general as one can hope for while ensuring the existence of such efficient 
algorithms. 

The remainder of the paper is organized as follows. Section 2 defines the 
syntax and semantics for DLClass and Datalog„ 5 . Section 3 reduces the pro- 
blem of answering queries in Datalog „5 to the logical implication problem for 
DLFD and subsequently to logical implication problems in DLClass. Section 4 
completes the picture by reducing the logical implication problem for DLClass 
to Datalogns. Section 5 discusses special cases in which PTIME reasoning is 
possible. We conclude with a summary in Section 6. 




On Decidability and Complexity of Description Logics 



57 



2 Definitions 

The syntax and semantics of DLClass and Datalog„s are given by the following. 



Definition 1 (Description Logic DLClass) Let F be a set of attribute na- 
mes. We define a path expression by the grammar “Pf ::= /. Pf | Id” for f € F. 
Let C be primitive eoncept description(s). We define derived concept descriptions 
using the following grammar: 

D ::= C 

I (all/i?) 

I (fdC:Pfi,...,Pffc^Pf),fc>0 
I (and D D) 

A subsumption constraint is an expression of the form C < D. 

The semantics of expressions is given with respect to a structure where 

A is a domain of “objects” and an interpretation function, that fixes the 
interpretations of primitive concepts to be subsets of the domain, C A, and 
primitive attributes to be total functions on the domain, f^ : A ^ A. This 
interpretation is extended to path expressions, Id^ = Xx.x and f. Pf^ = Pf'^ of^ , 
and to derived descriptions 

{aW f ny = {oe A: f^o) e oy 

(fd C : Pfi, . . . , Pffc ^ Pf)^ = {o G A :W G . 

Ati Pfi'(o) = Pf[(o') ^ PfAo) = PA(o')} 

(and L» 2 )^ = n 

An interpretation satisfies a subsumption constraint C < D if Q . 

For a given set of subsumption constraints F = {Ci < Di \ Q < i < n} (a ter- 
minologyj and a subsumption constraint C < D (a posed question^, the logical 
implication problem asks if F \= C < D holds, i.e, if all interpretations that 
satisfy F must also satisfy C < D. 

Limiting the left-hand-side of subsumption constraints in terminologies to be 
primitive concepts is a common assumption to avoid reasoning about equality 
between general concept descriptions. In contrast, requiring the left-hand-side 
of the posed question to be a primitive concept is not a real limitation since a 
more general logical implication problem of the form F ^ D\ < D2 can always 
be rephrased as L" U {C < Di\ \= C < D 2 , where C is a primitive concept not 
occurring in FVJ {Di < D2}. 

In the rest of the paper we simplify the notation for path expressions by 
omitting the trailing Id. We also allow a syntactic composition Pfi . Pf2 of path 
expressions that stands for their concatenation. 
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Definition 2 (Datalog „5 [12]) Let pi be predicate symbols, ft function sym- 
bols such that Pi ^ fi, and X,Y,... variables. A logic program P is a finite set 
of Horn clauses of the form 



Po(i°, s?, ■ • ■ , 4) ^ sjj, . . . ,Pk{t^, sfj 



for k > 0, where the terms P and s* are constructed from constants, function 
symbols and variables. We say that P is a Datalog„s program if 



1. f is a functional term; a variable, a distinguished constant 0, or a term of 
the form f{t, si, . . . si) where f is a function symbol, t is a functional term, 
and si, . . . ,si are data terms, 

2. Sj are data terms; variables or constants different from 0, and 

3. no variable appears both in a functional and a data term. 



We say that a DatalognS program is in normal form if the only predicate and 
functions symbols used are unary, and whenever a variable appears in any pre- 
dicate Pi of a clause then the same variable appears in all the predicates of the 
same clause. 

A recognition problem for a DatalognS program P and a ground (variable- 
free) atom q{t, Si, . . . , Sk) is the question does P ^ q(t, si, . . . , Sk) hold, i.e., 
q{t, s\,. . . ,Sk) is true in all models of P? 

It is known that every Datalog„s program can be encoded as a normal Datalog „5 
program [11,12]. Moreover, every normal Datalog„s program can be divided 
into a set of clauses with non-empty bodies and a set of ground facts (clauses 
with empty bodies). Proofs of theorems in the paper rely on the following two 
observations about logic programs [18]: 

1. The recognition problem P \= q for a ground q is equivalent to checking 
q € Mp where Mp is the unique least Herbrand model of P. 

2. Mp can be constructed by iterating an immediate consequence operator Tp 
associated with P until reaching fixpoint; Mp = Tp{%). 

To establish the complexity bounds we use the following result about Datalog„s 
programs: 



Proposition 3 ([12,14]) The recognition problem for DatalognS programs (un- 
der the data- complexity^ measure) is DEXPTIME-complete. The lower bound 
holds even for programs in normal form. 



3 Lower Bounds 

In this section we show that the recognition problem for Datalog„s can be redu- 
ced to the (infinite) implication problem for path-functional dependencies. We 



^ Complexity of the problem for a fixed set of symbols. 




On Decidability and Complexity of Description Logics 



59 



study this problem in a DL dialect DLFD in which all subsumption constraints 
are of the form 



THING < (fd THING : Pfi, . . . , Pf* Pf). 

THING is a primitive concept interpreted as the domain A. In the rest of this 
section we use the shorthand Pfi, . . . , Pf^ — >■ Pf for the above constraint. 

It is easy to see that DLFD problems can be trivially embedded into DLClass. 
We simply consider THING to be a single primitive concept description such that 
THING < (all / THING) for every primitive attribute /. Therefore, lower bounds 
for DLFD also apply to DLClass. 

Notation: we associate two Datalog„s functional terms Pf(0) = fk{- ■ ■ /i(0) • • •) 
and Pf(-T) = /fc(- • • /i(-T) • • •) with every path expression Pf = fi. ■ ■ ■ .f^. Id, 
where 0 is a distinguished constant and X is a variable. Similarly, for every 
Datalogns term t = fi{- ■ ■ fk{X) ■ ■ •) there is a path expression Pf = • • • .fi. 

Id such that t = Pf(X). In the rest of the paper we overload the symbols pi and 
fi to stand both for unary predicate and function symbols in Datalog „5 and for 
primitive attribute names in the appropriate description logic, and use Pf(^) 
and Pf(0) to stand for Datalog„s terms. 

Theorem 4 Let P he an arbitrary normal DatalognS program and G = p(Pf(0)) 
a ground atom. We define 

- Up = { Pf ; .p[ _ . , Pf ; .p', ^Pf' .p' : _ 

p'iPf'iX)) ^ p'i(Pf'i(X)), . . . W) e P}, and 

- >PP,G = Pfi -Pi, • • ■ , P^-Pfe -)> Pf .p 

where pi(Pfi(0)), . . . ,Pfc(Pffc(0)) are all facts in P. 

Then P \= G Sp\= pp^c- 

Proof: =J>: We show that p(Pf(0)) G Tpifb) implies Up |= pp,G- 

If p(Pf(0)) G Tp(0) then there must be m > 0 such that p(Pf(0)) G Tpiftf). 
Then, by induction on to, we have: 

TO = I : immediate as it must be the case that p(Pf(0)) must be one of the 
facts in P, e.g., Pi(Pfi(0)) and therefore also Pf .p = Pf^ .pp Consequently, 
Pp H Tp,G as pp,g is a trivial path-functional dependency. 

TO > 1 : if p(Pf(0)) G T™(0) then there is a term f(0) (to be substituted for 
X) and A clause p(Pf^(A)) ^ pi(Pf^^(A)), . . . ,p;(Pf;(A)) in P such that 
Pf(0) = Pf'(f(0)) and p*(Pf'(f(0))) G T™-^(0). By the IH we have Ep ^ 
‘Ppp (Pf'(t(o))) therefore, by composition with the path-functional depen- 
dency Pfj .pi, . . . , Pf( ,pi — >• Pf .p G Ep we have Ep ^ Pp,g- 



<J=: Assume p(Pf(0)) ^ Tp{%). We construct an counterexample interpretation 
as follows: let oi, 02 G A be two distinct objects and Ti, T 2 two complete infinite 
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trees rooted by these two objects with edges labeled by primitive attributes, in 
this case fi and pi. Moreover, if p'(Pf (0)) € Tp{%) we merge the two subtrees 
identified by the path Pf' .p' starting from the respective roots of the trees. The 
resulting graph provides an interpretation for DLFD (the nodes of the trees 
represent elements of Z\, the edges give interpretation to primitive attributes) 
such that: 

(i) (Pfi .piY{oi) = (Pfj .piY{o 2 ) for all pi(Pfi(0)) G P, and 

(ii) every constraint in Sp is satisfied by the constructed interpretation: As- 
sume the interpretation violated a constraint Pf) .p'^, . . . , PfJ .pj — >• Pf' .p' G Up. 
Then there must be two distinct elements X\,X 2 such that (Pf) .p')'^(xi) = 
(Pf) -pYY^Y- From the construction of the interpretation and the fact that the 
sets of predicate and function symbols are disjoint we know that p)(Pfj(f(0))) G 
Pp(0) where f is a term corresponding to the paths from o\ and 02 to x\ and X 2 , 
respectively (note that all the paths that end in a particular common node in the 
constructed interpretation are symmetric). However, then p'(Pf (f(0))) G Pp(0) 
using the clause in P associated with the violated constraint in Ap, and thus 
(Pf' .p')^(xi) = (Pf' .p')^{x 2 ), a contradiction. 

On the other hand, (Pf .p)'^(oi) Y (Pf -pY Y 2 ) as p(Pf(0)) Y Fp(0), a contradic- 
tion. □ 

For the constructed DLFD problem we have iPp] -P \pp,g\ G ^{\P\ + |G|). Thus: 



Corollary 5 The logical implication problem for DLFD is DEXPTIME-hard. 
Since DLFD problems can be embedded into DLClass, we have: 

Corollary 6 The logical implication problem for DLClass is DEXPTLME-hard. 

4 Upper Bound and Decision Procedure for DLClass 

To complete the picture we exhibit a DEXPTIME decision procedure for DL- 
Class by reducing an arbitrary logical implication problem to the recognition 
problem for Datalog„s [12]. We start with two lemmas that are used to simplify 
complex DLClass constraints. 

Lemma 7 Let C\ be a primitive concept not in E U {C' < D'}- Then 

EU{C < (and Di D 2 )} |= C' < D' <:=^ E U {C < Di,C < D 2 } Y C < D' , 
EU{C < (all f D)}YC' <D' ^ EU{C < (all / C\), Ci < D} \= C' < D' . 
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We say that a terminology S is simple if it does not contain descriptions of the 
form (and D\ D 2 ) and whenever (all / D) appears in S then D is a, primitive 
concept description. Lemma 7 shows that every terminology can be converted 
to an equivalent simple terminology. 

Lemma 8 Let S be a simple terminology and Ci a primitive concept not present 
in S U {C < D,C' < D'}. Then 

S\=C < (and D 1 D 2 ) S\=C <Di and S \= C < D 2 , 

E^C < (all f D) ^ E U{C < (all / Ci)} 

U{Ci <C 2 -. E\=C < (all / C 2 )} N C'l < D. 

We say that a subsumption constraint C < D is simple if it is of the form 
C <C',C < (all / C'), and C < (fd C" : Pfi, . . . , Pffc ^ Pf); Lemmas 7 and 
8 allow us to convert general logical implication problems to (sets of) problems 
where all subsumption constraints are simple. For each such problem E \= tp we 
define a Datalog„s recognition problem Ps U |= as follows: 



Ps = { cl{X, Cj,Y) e- cl{X, Ci, Y) for all Ci < Cj G E (1) 

cl{f{X),Cj,Y) G- cl(X, a, Y) for all Ci < (all / Cj) G T (2) 

cl(X,y,l)^eq(X),cl(X,y,2) (3) 

cl(X,y,2)^eq(X),cl(X,y,l) (4) 

eq{f{X)) G- eq(X) for all primitive attributes / (5) 



eq(R(X)) ^ cl(X, Ci, 1), cl(X, c,-, 2), eq(Ri(X)), . . . , eq(^fc(X)) (6) 
eq(Pf(X)) ^ cl{X,Ci,2),cl{X,Cj, l),eq(Pfi(X)), . . . ,eq(Pffc(X)) (7) 
for all C* < (fd Cj : Pfi, . . . , Pffc ^ Pf ) G 27 } 

The clauses stand for the inferences of inheritance (1), direct typing (2), typing 
inferred from equalities (3-4), propagation of equality by primitive attributes 
(5), and path FD inference (6-7), respectively. In addition we use a set of facts 
to represent the left-hand-side of the posed question^: 

( {cl(0, Cj, 1)} for Ci < Cj 

l{cl(0,c„l)} for C, < (all / Q) 

^ ] {cl(0,c„l),cl(0,c„2),eq(Pfi(0)),...,eq(Pffc(0))} 

[ forC, <(fdC, :Pfi,...,Pffc->Pf) 

and a ground atom to represent the right-hand-side of the posed question: 

( cl(0,Cj, 1) for Ci < Cj 

G^= < cl(/(d),Cj, 1) for C, < (all / Cj) 

[ eq(Pf(0)) for a < (fd C, : Pfi, . . . , Pffc ^ Pf) 

Intuitively, the ground facts cl(Pf (0), Cj, z) and eq(Pf(0)) derived from using 
Ps stand for properties of two distinguished nodes oi and 02 and their des- 
cendents, in particular for Pf^(oi) G Cj and Pf'^(oi) = Pf^(o 2 ), respectively. In 
addition \Ps\ + \P^\ G 0{\E\ + |(^|) and \G^\ G O^). 

^ Essentially an application of the Deduction theorem. 
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Theorem 9 Let S he an arbitrary simple DLClass terminology and ip a simple 
DLClass subsumption constraint. Then S \= ip ^1=^ Ps Pip\= G^. 

Proof: (sketch) 

<J=: By induction on stages of Tp^^JP^ showing that every clause in Ps U P^p 
represents a valid inference (essentially the same as the “only-if” part of the 
proof of Theorem 4) . 

=J>: By contradiction we assume that ^ Tp^^p . We again construct an 
interpretation for DLClass starting with two complete infinite trees with edges 
labeled by primitive attribute names. We merge the two nodes accessible from 
the two distinct roots by the path Pf whenever eq(Pf(0)) G Tp^up^ (all children 
of such nodes are merged as well due to the clause eq{f{X)) ^ eq(X) G Ps). In 
addition we label each node n in the resulting graph by a set of class (identifier) 
labels Ci if n = oj. Pf and cl(Pf(0), c,, j) G Tp^yjp^ (i = i,2). 

Nodes of the resulting graph then provide the domain A of the interpretation; 
primitive concept Gi is interpreted as the set of nodes labeled Ci and the inter- 
pretation of primitive attributes is given by the edges of the graph. 

The resulting interpretation satisfies S (by case analysis for the individual con- 
straints in E using the corresponding clauses in Ps) and the “left-hand” side of 
p (follows from the definition of P^), but falsifies the “right-hand” side of p. □ 

This result completes the circle of reductions 

(normal) Datalog„s — ^ DLFD — DLClass — >■ Datalog„s 



Corollary 10 Logical implication problems in DLFD and DLClass are DEXP- 
TLME- complete. 

In particular, this also means that every DLClass problem can be reformulated 
as a DLFD problem, and consequently that typing and inheritance constraints 
do not truly enhance the expressive power of DLClass. 



5 Polynomial Cases 

Previous sections have established DEXPTIME-completeness for DLClass. Ho- 
wever, there is an interesting syntactic restriction on uniqueness constraints in 
DLClass that (a) allows for a low-order polynomial time decision procedure that 
solves the logical implication problem, and (b) has a number of practical appli- 
cations in the database area [16,17]. In particular, the restriction requires all fd 
descriptions in a terminology E to be regular; in other words, to have one of the 
forms 



1. (fd C : Pfi, . . . , Pf . Pf', . . . , Pffc ^ Pf) or 

2. (fdC:Pfi,...,Pf.Pf',...,Pffc^Pf./). 
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Given the connection between DLFD and Datalog„s established by Theorem 4, a 
natural question is whether there is a syntactic restriction of Datalog„s programs 
that leads to an efficient decision procedure. We identify such a restriction in 
the following definition. 



Definition 11 (Regular Datalogns) Let P be a normal DatalognS program. 
We say that P is regular if every clause with a non-empty body has the form 

p{t{X)) ^ pi{ti{X)), . . .,q{t'{t{X))), . . . ,pk{tk{X)) 

for some terms t,t' ,ti, . . . ,tk (note that any of the terms may be just the variable 
X itself). 



Theorem 12 The recognition problem P \= G for regular DatalognS programs 
has a low-order polynomial time decision procedure. 

Proof: Consider the conversion of P to i7p presented in Theorem 4. It is not 

hard to see that such a conversion of any clause with non-empty body in P to 
a constraint in DLFD would obtain a regular fd description. The statement of 
the theorem then follows since the conversion of the recognition problem takes 
0{\P\ -\- |G|) time, \Xp\ -\- \ipp,G\ G ^{\P\ + |G|), and the obtained (equivalent) 
logical implication problem can be solved using a, 0{\Xp\- Iv^p.gI) procedure for 
regular DLClass problems [16]. □ 

In addition, a slight generalization of the regularity condition in DLClass leads 
to intractability [16]. The same turns out to be true for regular Datalog„s. 



Definition 13 (Nearly-regular Datalognp) We define P as a nearly regular 
DatalognS program if every clause with non-empty body has one of the forms 

1. p{f{t{X)))^Pi{ti{X)),...,q{t'{t{X))),...,pk{tk{X)) or 

2. p{t{f{X))) ^ pi{ti{X)), . . .,q{t'{t{X))), . . . ,pk{tk{X)). 

Essentially, near-regularity allows an additional function symbol / to appear in 
the head of a clause. 

Theorem 14 For an arbitrary normal DatalognS program P there is an equi- 
valent nearly regular DatalognS program P' . 

Proof: Consider P' that contains the same facts as P, and for every clause 



p(/i(. . . fk{X) . . .)) ^ Pi{ti{X)), . . . ,pi{ti{X)) G P 
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with non-empty body, contains a set of clauses 

qi{fk{X)) ^ pi{ti{X)), . . . ,pi{ti{X)), 
q 2 {fk-l{fk{X))) <— qi{fk{X)), 



qk{fi{- ■ ■ fk{x ) . . .)) ^ %-i(/2(- ■ • fk{x ) . . .)), 
p(M. . . fk{X) . . .)) ^ qM. . . fk{X) . . .)) 



for some predicate symbols qi, ... ,qk not occurring in P. Each of the generated 
clauses is nearly regular (by satisfying the first form in Definition 13). Similarly, 
we could have generated P' where all clauses have satisfied the second form in 
Definition 13. 

In both cases, the statement of the theorem is easy to establish from the con- 
struction of P'. □ 



Corollary 15 The recognition problem for nearly regular Datalog„s programs 
is DEXPTIME-complete. 



6 Summary and Discussion 

We have presented a description logic dialect called DLClass that is capable 
of capturing many important object relational database constraints including 
inheritance, typing, primary keys, foreign keys, and object identity. Although 
DLClass and its applications in information systems has been explored in earlier 
work [4,16,17], this paper establishes a strong connection with Datalog„s. We 
have explored this connection and obtained the following results: 

— An open problem relating to the complexity of DLFD (and therefore to the 
complexity of DLClass) that was originally considered in [15,20] has been 
resolved. In particular, the result implies that the regularity condition for 
DLClass [16] is a boundary between tractable and intractable problems in 
DLClass. 

— A consequence of our relatively straightforward linear reduction of DLClass 
to DLFD implies that inheritance and typing constraints do not appreciati- 
vely add to the utility of DLFD for capturing object relational schema. Such 
constraints are straightforwardly expressed in DLFD. 

— We have identified a subset of the Datalog„s recognition problems that can 
be solved in PTIME. Moreover, we have shown that the regularity condition 
for Datalogns programs is a boundary between polynomial and exponential 
time problems in Datalog„s. 

In general, a DL dialect is obtained by a selection of concept constructors that 
will in turn determine the tradeoff between the utility of the dialect for capturing 
schema structure and the complexity of the associated implication problem. For 
example, CLASSIC [3] opts for a less expressive selection that excludes negation 
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and disjunction constructors in order to ensure an efficient polynomial time 
algorithm for its implication problem. Similarly, DLClass may be viewed as a 
minimal core of CLASSIC extended with uniqueness constraints. Conversely, 
other dialects opt for a more expressive selection that includes such constructors 
with the expectation that model building implication checking procedures work 
well on typical real world schema structure [10]. 

Note that DLCLass excludes any consideration for so-called roles which are 
essentially set-valued attributes. Allowing roles to be used in formulating uni- 
queness constraints is an interesting avenue for future work, although allowing 
them to be used only at the end of path expressions is a straightforward genera- 
lization of our results that would accommodate the notion of key in [9] . Also note 
that DLClass interprets attributes as total functions. However, partial functions 
can be straightforwardly simulated by employing a simple notational convention. 

It is worth mentioning other work on various forms of path constraints for 
graph based data models relating to semi-structure data and object relatio- 
nal schema [5,6,13] that have similar objectives to research in the area of DLs. 
Indeed, many of the results in this work appear to have close analogies to corre- 
sponding work on DLs, although a more thorough exploration of this relationship 
is an interesting avenue for further research. 

Another direction for future research relates to a property that DLFD shares 
with many DL dialects: its arbitrary and finite logical implication problems do 
not coincide. In particular, 

{THING < (fd THING : A.B Id)} h THING < (fd THING : B ^ Id) 

holds if and only if one requires the domain for any interpretation to be finite 
[15]. Furthermore, there is some evidence that existing techniques for finite model 
reasoning are not suited to more general information integration problems that 
involve access to large finite information sources such as the WEB [1]. 
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Abstract. A key facility of active database management systems is their 
ability to detect and react to the occurrence of events. Such events can 
be either atomic in nature, or specified using an event algebra to form 
complex events. An important role of an event algebra is to define the 
semantics of when events become invalid (event consumption). In this 
paper, we examine a simple event algebra and provide a logical frame- 
work for specification of various consumption policies. We then study the 
problems of equivalence and implication, identifying a powerful class of 
complex events for which equivalence is decidable. We then demonstrate 
how extensions of this class lead to undecidability. 



1 Introduction 

First we briefly introduce the context of our work, list our contributions and 
survey related works. 



1.1 Events in Active Databases 

Active databases provide the functionality of traditional databases and addi- 
tionally are capable of reacting automatically to state changes without user 
intervention. This is achieved by means of Event-Condition-Action (EGA) ru- 
les of the form on event if condition do action. The event part of an EGA rule 
specifies the points in time when the rule should become triggered and is the 
focus of this paper. Events are of two kinds: primitive events correspond to ato- 
mic, detectable occurrences, while complex events are combinations of simple 
events specified using an event algebra. Detecting the occurrence of a complex 
event corresponds to the problem of evaluating an event query (specified using 
the event algebra) over a history of all previously occurring primitive events.^ 
A notable characteristic of active database prototypes has been the rich and ex- 
pressive event algebras developed [11]. An interesting feature of such algebras is 
the notion of event consumption policies. These are used to specify when certain 
instances of an event should no longer be considered in the detection of other 

* Supported by EPSRC No. GR/L26872. 

^ In this paper, we will use the terms complex event detection and (complex) event 
query evaluation over a history synonymously. 
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events depending on it. Previous work has predominantly dealt with the defi- 
nition of policies for a range of applications, without focusing on their formal 
foundations. Indeed, there is not even widespread agreement on which policies 
should be standard features of an event algebra. In contrast, one of the purposes 
of our work is to give a framework for specifying event consumption policies and 
investigate what consequences their inclusion has for the expressiveness of the 
event query language as a whole. 

In addition to expressiveness, another important question is to consider what 
kinds of analysis are possible for different classes of event queries. For example, 
the histories over which event queries may be evaluated can be very large and 
it is important to avoid redundant evaluation. In fact, efficient event detection 
has been identified as being a major challenge in improving active DBMS sy- 
stem performance ([ 6 ]). One possible optimisation involves being able to reason 
about event query equivalence and implication. Given event queries qi and <72, 
if qi q2, then only one of {<71,(72} needs be evaluated over the event history 
(whichever is less expensive, according to some cost model). Alternatively, if 
qi <72, then one need not evaluate (72 whenever qi is true. Being able to decide 
these questions is therefore a useful property for an event query language to 
possess and the second half of this paper provides (un) decidability results for a 
core event language and a selection of consumption policies. 



1.2 Contributions 

We make two main contributions in this paper. The first is a formal procedure 
for the specification of event consumption which is then useful for comparison of 
consumption policies. The second contribution is the presentation of decidabi- 
lity and undecidability results for implication and equivalence of event queries. 
We define a powerful query class for which equivalence is decidable, yet impli- 
cation is undecidable. Our results also highlight the important role that event 
consumption has in determining whether an effective analysis is possible. To our 
knowledge, this is the first paper to consider decision questions for event queries 
making use of different consumption policies. 

1.3 Related Work 

A number of active database prototypes have been built providing sophisticated 
event algebras (e.g. SNOOP, REACH, NAOS, CHIMERA [II]), but there are 
not associated results on the expressiveness of these algebras which could be 
used as the basis for (global) reasoning about implication and equivalence of 
event queries. The event language of the system ODE [ 5 ], however, has been 
shown to have the power of regular expressions, but event consumption policies 
are not an explicit feature of the language. [ 10 ] proposes the use of Datalogis for 
expressing complex event queries. The advantage being that it has a well-defined 
formal semantics. Such a language is rather more powerful than the algebras we 
consider in this paper, however, and indeed equivalence of Datalogis expressions 
can easily be shown to be undecidable. Similar considerations apply to work 
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on modelling complex events using the Kowalski~S ergot event calculus [3] or 
coloured Petri nets [4] as used in SAMOS. 

An early work which recognised the importance of event consumption is [1] , 
where a variety of so-called parameter contexts were proposed for matching and 
consuming events. Work in [15] provides a meta-model for classifying a number 
of properties of complex event formalisms. 

Equivalence and implication are of course important problems elsewhere in 
database theory, e.g., for conjunctive queries. It is not obvious how to use these 
results for reasoning about event queries, since the queries considered usually lack 
one of the following features: an ordering on the underlying domain, negation, 
or the ability to mimic event consumption. Work on temporal logics, however, 
is directly relevant for event reasoning, and we discuss and make use of these 
results in the following section. 

Other work on temporal aspects of active databases includes [2] and [13], 
where the focus is on evaluation and expressiveness of temporal conditions. A 
characteristic that distinguishes conditions from events, however, is that they do 
not have associated consumption policies. Event languages are also used in other 
database areas, such as multimedia, where they are used in specifying interaction 
with documents [9]. 



2 Semantics for Complex Events 

We assume the following about events in the spirit of [15]: a) The time domain 
is the set of natural numbers, b) Primitive events are detected by the system 
at points in time — that is, we can treat primitive events as primitive symbols 
interpreted in time, c) Event types are independent and they can be freely used 
as components to generate complex events — i.e., we can treat them as pro- 
positional variables that we can use for building compound formulas, d) The 
occurrence time of the terminator of a complex event is the occurrence time of 
the complex event, i.e., an event occurs when the last component of the event 
has occurred, e) Events can occur simultaneously. 

Note that we model complex events as boolean queries over the history of all 
previously occurring primitive events. An event query is true over a history iff it 
is true at the last point in the history (of course it may well be true for prefixes 
of the history also). 

Remark 1. We give two preliminary observations concerning our approach. 

a) In practice, primitive or complex events could have other information asso- 
ciated with them (e.g., parameters or variables describing why they occurred)^. 
We ignore such a refinement in this paper, instead concentrating on the problem 
of recognising whether an event has occurred, as opposed to explaining why it 
occurred. 

^ If the domain of such parameters is finite, then we could of course model them by 
defining extra primitive events. 
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b) We are considering the definition of event queries and not techniques for 
their evaluation. In practice, incremental methods may be appropriate, but such 
evaluation methods are an orthogonal issue to the problems of expressiveness 
and analysis that we address. 



2.1 Operations on Events 

Next we review the most often used operations on events. We also propose a 
concise and formal way to express events occurring in time by using temporal 
logic over natural numbers. 

We will model events occurring in active databases in time. Let 91 = (N, <, I) 
be a structure such that N = {0, 1, ... } is the set of natural numbers, < is 
their usual ordering according to magnitude, and I is an interpretation function 
associating a subset I{p) of N to every primitive event p (i.e., when it occurred). 

Basic operations. For given events e and e', we first consider the following basic 
operations: 

— sequence operation e ; e': events e and e' should occur in this order; 

— simultaneous operation e || e': the events occur simultaneously; 

~ conjunction operation e □ e': both occur but it can happen in any order; 

— disjunction operation e U e': at least one of them occurs; 

— negation operation ~ e: e does not occur (at a given time point). 

Using the semantics above, it is straightforward to define when an event e occurs 
at a time-point t G N — notation: jft,t \= e. Note that e □ e' can be defined in 
terms of U, || and ;: (e || e') U (e ; e') U (e' ; e). 

We will see shortly how to express the above events using temporal logic, and 
we will also extend this translation to more complex event definitions as well. 



2.2 Temporal Logic 

Next we briefiy recall the basics of temporal logic. ^ The language of our propo- 
sitional temporal logic contains the usual propositional connectives conjunction 
A and negation and the binary temporal connectives until U and since S. We 
also assume that a countable set P of propositional variables is at hand. 

Other propositional connectives (disjunction V, implication — >■, true T, false 
_L) are defined in the usual way. We define the unary temporal connectives future 
T and past V: 

Tp = T) and Vp 5((p, T). 

As usual, Qp denotes and T-Lp stands for while is a shorthand 

d©f 

for Vpy TpM p, and Up = -lO-Kp. We will use the following definitions as well: 

® We will use only past operators for modelling events in this paper, but we present a 
general framework that will enable us to include, e.g., rules as well. 
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\>ip _L) for next, <np S{(p, _L) for previous, and Qip 0{ip A ~<VT) for 

beginning. 

In some applications, e.g., for expressing periodic events, it will be useful 
to consider a temporal language with an additional fixed point operation pp.p. 
It is defined in the usual way: roughly speaking, if the free occurrences of the 
propositional symbol p are in the scope of past temporal operators in tp, then we 
can form the formula pp.p. Since at any time point the past is finite (we have 
natural numbers flow of time), this guarantees the existence of a fixed point; see 
[7] for more details. 

We are going to interpret temporal formulas over flows of time on the natural 
numbers. In more detail, let N = 0, 1, . . . , n, . . . be the set of natural numbers 
and < their usual ordering (according to magnitude). Let v be an evaluation of 
the propositional variables in P, i.e., w : P — >■ P(N). Then truth of formulas in 
the model TI = (N, <,v) is defined in the usual way; the case of the temporal 
connectives is as follows: for every n G N, 

or, n Ih W {p, Tp) <1=^ for some m > n,^,m\\- p 

and for all I with n < I < m,'yi,l \\- ip 
TI, n Ih S{p, Ip) <1=^ for some m < n,^,m\\- p 

and for all I with m < / < n, Tt, / Ih f/:. 

Finally, the interpretation of fixed point formulas is as follows. 

(N, <, u), n Ih iff n € w{p) 

for that unique evaluation w of the propositional variables that satisfies 

— for every atom q distinct from p, w{q) = v{q), 

— n € w{p) iff (N, <,w),n Ih p. 



Theorem 1. The complexity of the decision problem for temporal logic ofU and 
S over natural numbers flow of time is pspace, [12]. This remains true even if 
we include the fixed point operator, [7]. 



2.3 Complex Event Definitions 

First we give the translations of the basic operations on events into temporal 
logic: 

— Ctr = e for primitive event e; 

(c , 6 )tr = ^tr ^ 

(c II e )tr = Ctr A e^j., 

- (e n e')tr = {etr A e(^) V {ctr A Pe(^) V (e(^ A Vetr)] 

(e LI e )tr = Ctr V 

^)tr = 
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3 Consumption Policies 

Event consumption policies are used to specify when certain instances of an event 
should no longer be considered in the detection of other events that depend on 
it. There are many ways in which event instances might be consumed (or not) 
when composite events are detected. In this section, we recall some representa- 
tive policies for consuming events, similar to those of [1]. We propose a formal 
logic-based definition for consumption policies that will enable us to look at the 
problem of the complexity of deciding if two events together with consumption 
policies are equivalent. 

Several consumption policies have been considered for the sequence operation 
on events. First we give a general definition, then look at particular policies and 
express them in the general framework. In the next section, we will consider the 
complexity of deciding equivalence and implication between event queries using 
consumption policies. 

Definition 1. Let = (N, <,/) be a model with natural numbers universe N 
together with the usual ordering < and an interpretation function I that asso- 
ciates a subset I{p) ofN to every primitive event p. Let e and e! be events and 
i? C N X N &e o binary relation that may depend on e and e' . Then we define 
the sequence operation as follows: 

01, t 1= (e e') 01, t |= e' and 01, t' \= e for some t' < t such that R{t' , t). 

That is, e e' holds at a time-point t iff e ; e' occurs at t and the witness t' for 
e is related to t by R. 

Note that by this definition we have a recursive way of defining more complex 
events under consumption policies. For instance, given two relations R and S, 
the definition of the event {p q) ',s ^ is ®s follows. Given a model 01, first 
apply the definition of to determine for each t G N if p g occurs at t; then 
apply the definition of ;g to e ;g r where e is the event p q. In other words, 
we compute a composite event by first computing the participating events and 
then, treating them as atomic events, the composite event. We also note that the 
above definition is in the style of modal logic, where the extra-boolean operations 
are evaluated according to accessibility relations. More generally, our definition 
of event consumption is compositional in the sense that the meaning (or truth) 
of a formula is determined by the meaning of its subformulas (plus the structure 
of the model). 

Particular consumption policies are defined by imposing certain conditions 
on the relation R. For instance, the sequence operation ; can be treated as an 
operation with the most permissive consumption policy. In this case, we can 
take R as the universal binary relation, i.e., there is no further restriction on the 
p- witness t' for OT, t \= p q other than t' < t. Furthermore, as we will see below 
for the most-recent and cumulative consumption policies, there are cases when 
R is essentially a unary predicate. This suggests a definition of a subclass CP\ of 
consumption policies when R can be substituted by a first-order formula using 
only unary predicates and the ordering < in Definition 1. We already mentioned 
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that the temporal logic oiU and S is expressively complete w.r.t. first-order logic 
over natural numbers. Then for every operation CP\, there is an equivalent 
temporal logic formula ipn with parameters: 

Vr{p.<i)- ( 1 ) 

We can further strengthen this result by allowing definitions of R using monadic 
second-order logic SIS. Let CPsis be the resulting class of consumption policies. 
Again we have the equivalence result that the temporal logic of U, S and fixed 
point is expressively complete w.r.t. S\S over natural numbers, [7]. Thus the 
above equivalence 1 holds for CPris and fixed point temporal logic. 

We will demonstrate the idea of consumption policies on the example of the 
composite event e which is the sequential composition of the (primitive) events 
p and q. We will use the following example, see Figure 1. Let us assume that we 
are at time point 11 and the events p and q happened as follows; 

/(p) = {0,1,2,5,9,10} 

/(<?) = {3,4,6,7,8,11}. 
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Most, recent: e = p q 



Cumulative: e — p -,q q 




FIFO: e — p]p q 




LIFO: e=p;^q 



Fig. 1. Consumption policies 



Most recent. In this context, only the most recent occurrence of the initiator 
(the first component of the event e) p is used. After detecting e, that occurrence 
of p which initiated e is “consumed”, i.e., it cannot initiate other occurrence of 
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e. Also the terminator (the final component of the event e) is consumed, i.e., it 
cannot be the terminator of another occurrence of e. Such a policy is useful for 
situations where events occur rapidly and successive occurrences only refine the 
previous value (e.g., sensor applications). 

In the example above, e is first detected at 3 when e is initiated by the 
occurrence of p at 2 and terminated by the occurrence of q at 3; we will denote 
this by (e,3) = (p, 2;g, 3). Then other occurrences of e are as follows: (e,6) = 
(p, 5; g, 6) and (e, 11) = (p, 10; g, 11). 

This consumption policy can be defined by the following relation R in the 
above definition: 

i?(t, t') 91, t 1= g, 91, t' \= p, t' < t, 

and for each t” , if t' < t" < t then 91, t" p and 91, t” ^ g. 

That is, R is the partial function that associates consecutive p- and g-points. 
Note that in the above definition of i?, there was no binary relation apart from 
<. Thus there is a temporal formula expressing this consumption policy: 

91,t lhgA5(p,-.pA-.g) ^^=^91,t 

We will use the notation e;^^e' for denoting the event e;e' using the most-recent 
consumption policy.^ 



Cumulative. In this version, all occurrences of the initiator are accumulated until 
the composite event is detected, and all these occurrences are consumed when the 
composite event is detected. Such a policy is useful where multiple occurrences 
of an event need to be grouped together (e.g. multiple deposits preceding a 
withdrawal). Using the above notation, we have (e, 3) = (p, 0, 1, 2; g, 3), (e, 6) = 
(p, 5; g, 6) and (e, 11) = (p, 9, 10; g, 11). The definition of R is as follows: 

R{t, t') 91, t 1= g , 91, t' 1= p , t' <t, 

and for each t”, if t' <t" <t then 91, t" <7- 

Again, we can express this by a temporal formula q A S{p, ~<q). We will use the 
notation e ;q e' for denoting the event e ; e' using the cumulative consumption 
policy. 

Using the above definition of ‘most recent’ and ‘cumulative’ (or the corre- 
sponding temporal logic formulas), one easily shows the following. 

Claim. The most-recent and cumulative consumption policies are equivalent. 
That is, for any given model 91 and t € N, we have 

i h e ]mr e' h e ;c e'. 

From now on, the subscript X in will stand for a particular consumption policy 
instead of the underlying binary relation. 
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FIFO. ® The event e is detected whenever q occurs and there is an “unconsumed” 
earlier occurrence of p, further the earliest such occurrence of p is the initiator of 
the event. An example use of this policy could be if p represented the availability 
of a service instance and q represented a request for such a service. We have 
(e,3) = (p,0;g,3), (e,4) = (p,l;q,4), (e, 6) = (p,2;q,6), (e,7) = (p,5;g,7) and 
(e, 11) = (p,9;g, 11). 

The underlying relation R is defined in a recursive way in this case. For 
every n G N, let 01„ be the initial segment of 01 defined by n: 01„ is the model 
01 relativized to N„ = {t G N : t < n}, i.e., we restrict the universe and the 
interpretation of events to {t G N : t < n}. We define Rq be the empty relation. 
Now let us assume that C x has been already defined. Rn+i is defined 
as follows. If 01, n + 1 ^ q, then Rn+i = Rn- If 01, n + 1 \= q, then we define 
Rn+i = U {(to, n + 1)}, where m < n satisfies 

01, TO ^ p, for each t <n, (to, t) ^ Rn, 

and for each t < m, if 01, t ^ p then i?„(t, t') for some t' < n, 

provided such to exists; otherwise we define Rn+i = Rn- We let R be the union of 
the chain {Ri : i G N). Thus i? is a partial injective function that associates every 
q-point X to the earliest p-point y such that y < x and there is no x < a; with 
R{y,z). Then defines the sequence operation under the FIFO consumption 
policy. We will use the notation e ;p e' for denoting the event e ; e' using the 
FIFO consumption policy. 

LIFO. ® This is a similar policy to the previous one, but the event e is initiated 
by the last “unconsumed” previous occurrence of p. In our example, we have the 
following occurrences: (e, 3) = (p, 2;q, 3), (e,4) = (p, l;q,4), (e,6) = (p, 5;g, 6), 
(e, 7) = (p, 0; q, 7) and (e, 11) = (p, 10; q, 11). 

Again the corresponding relation R is defined recursively. We define Rq be 
the empty relation. Now let us assume that Rn Q N„ x N„ has been already 
defined. First assume that 01, n + 1 ^ q and in this case define Rn+i = Rn- If 
01, n + 1 \= q, then R„+i = Rn U {(to, n + 1)} where m < n satisfies 

01, TO ^ p, for each t < n, (to, t) ^ i?„, 

for each t, ifm<t<n+l and 01, t ^ p then i?„(t, t') for some t' , 

provided such to exists; otherwise Rn+i = Rn- We let R be the union of the 
chain {Ri : z G N). Then defines the sequence operation under the FIFO 
consumption policy. We will use the notation e;pe' for denoting the event e ; e' 
using the LIFO consumption policy. Again, R satisfies the following condition: 
it is a partial function that maps every p-point x to the first g-point y such 
that X < y and there is no z < p with R{x,z). Then it is not surprising that 
satisfiability of FIFO and LIFO events coincide. 

Claim. The FIFO and LIFO consumption policies are equivalent. 

® First in first out. 

® Last in first out. 
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Remark 2. Although we have only looked at four examples, there are obviously 
(infinitely many) other consumption policies in the classes CPi and CPsis- The 
purpose of defining these two classes is as a classification mechanism and, as 
we shall see, CP\ and CPsis represent classes of policies for which analysis is 
“easy” . 

Remark 3. The results of the next section imply that and are not definable 
in terms of CPi or CPsis operations. Intuitively, this is because their definition 
requires the relation i? to be a “true” binary relation (one that can’t be simulated 
by operations on unary relations). 

4 Decidability of Events 

In this section we look at the problem of deciding equivalence between events 
using consumption policies. We will consider several event algebras determined 
by various choices of operations. Let £i be the event algebra in which events 
are generated by the following operations: sequence ; with any consumption 
policy from CPi U CPsis, 11? n, U, ~. In the definition of £ 2 , we allow any 
basic operation ||, □, U, plus the sequence operation under the most-recent, 
cumulative, FIFO and LIFO policies ;/ where I € {MR, C, F, L}J So, £1 is an 
event algebra that includes the basic operations plus the ability to define “well 
behaved” consumption policies. £2 is an event algebra including the same basic 
operations, but allowing the use of the more expressive policies ;l and ;s. 

Our aim is to find out the complexity of implication and equivalence. The 
implication problem is defined by: 

given two events e and e', whether for every 01 and t G N, 01, t |= e implies 

01, t \= e' . 

We will write e eMf e' is implied by e. Equivalence of events, in symbols 
e e', is defined by mutual implication. 

Theorem 2. Let e and e' be two event queries written in the event algebra £ 1 . 
The complexity of deciding implication and equivalence between e and e' is in 
PSPACE. 

Next we look at the problem of equivalence between queries that can use either 
the ]F and operator (but not both). An additional restriction is that the chosen 
operator from {;l , ;f } can be used no more than once. Such a restriction would 
seem realistic in practice, given the likely difficulty in understanding/writing 
event queries using {;l , ;f } multiple times. 

Theorem 3. Let qi and q 2 be event queries of the event algebra £2 such that 
each uses one instance of either ;f or ;l- Then the problem of whether qi q 2 
is decidable. 

^ As future work, we plan to extend the definition of £2 to also allow any policy in 
C Pi U CPsis. 
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Proof. We show how to translate each qt (i G { 1 , 2 }) into a deterministic one 
counter machine. Since equivalences of such machines is decidable (in exponential 
time of the size of the machines [ 14 ]), it follows that equivalence of qi and 
q2 is decidable. The difficulty of the translation arises from ensuring that the 
constructed machine is deterministic. 

A deterministic one counter machine (docm) is a deterministic pushdown 
automaton having a stack alphabet of only one symbol. In more detail, 



M = {S, Acc, St, A, 6) 

where A is a finite set of states, Acc C A is the set of accepting states, St G S is 
the starting state, A is an alphabet and <5:AxAxN— ^-AxNisa deterministic 
transition function such that if S{a, S, n) = (S', m), then m G {n, n+ 1 , n— IjnN. 

Recall that an event q is satisfiable if there is an initial segment 01 „ of 01 
such that 01 , n \= q. The event history of q, H{q), is the set of models for q in 
the above sense. For any q using at one instance of the LIFO/FIFO operator, 
we show how to construct a docm Mg such that L{M) = H{q) (where L{M) is 
the set of input histories accepted by the machine and H{q) is the set of event 
histories for which q is true). 

We show how to build docm for recognizing events in our algebra. We wish 
the docm to enter an accepting state every time the query q is true in the history. 

Let us fix a language L of primitive events consisting of those events which 
occur in qi or q2- The alphabet will consist of subsets of L (this is needed to 
model histories when events can occur simultaneously; otherwise single events 
as letters would do). We will construct the machines equivalent to qi and q2 by 
recursion. 

e where e is a primitive event. We define M = {E, Acc, St, L, < 5 ) as follows. The 
machine has 3 states, a start state Si = St, an intermediate state S2 and 
an accepting state S3 G Acc. The transition function 6 is defined regardless 
of the counter which will remain empty during the transitions (and will be 
omitted below): 



6{E, Si) = S3, S{F, Si) = S2, S{F, S2) = S2, 

S{F, S2) = S3, S{F, S3) = S3, 5 {F, S3) = S2 

where F and F are any subsets of L such that e G F and e ^ F. 

Cl II 62. Recursively build a docm Mi = (Ei, Acci, Sti, L,Si) for recognizing ci 
and a docm M2 = {E2, Acc2, St2, L, 62) for recognizing 62. We now build a 
docm M = {E, Acc, St, L,S) which is the “product” of Mi and M2. Each 
state in M corresponds to a pair of states from Mi and M2'. E = Ei x E2. 
The start state of M corresponds to the start states of Mi and M2: St = 
(Sti,St2). The pairing of states and transitions then obey the following 
conditions 6{F,{Si,Tj)) = {Sm,T„) where 6i{F,Si) = Sm and 52{F,Tj) = 
T„. (Si,Tj) is accepting for M iff Si is accepting for Mi and Tj is accepting 
for M2: Acc = Acci x Acc2. 
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~ ei- Build a docm for recognizing ei. Change all accepting states into non- 
accepting states and vice versa (except for the start state which remains the 
same). 

ei U 62- Similar to ei || 62, but now {Si, Tj) is accepting for M iff Si is accepting 
for Ml or Tj is accepting for M2- 

Cl ; 62- Build a docm Mi for recognizing ci. Construct a new accept state for 
Ml called Sa and S{E,Sa) = Sa (for any set E of primitive events). For 
each accepting state Si yf Sa in Mi, make it non accepting and alter the 
transitions such that S{E, Si) = Sa for any set E of primitive events. Call 
the new machine M[. Construct a docm M2 for recognizing M2. Now apply 
the “product” construction (used above for recognizing ci || 62) on M[ and 
M2. Intuitively, the state Sa has a “delaying” effect, the machine M will 
reach an accepting state if M2 is in an accepting state (i.e., 62 occurs now), 
and Ml was in an accepting state before (i.e., ei occurred in the past), 
ei ',]^ &2- Build docm Mi = {Si, Acci, Sti, L, i5i) and M2 = {S2, Acc2, St2, L, S2) 
for recognizing ei and 62, respectively. Let S^ be a set of the same cardinality 
as Si such that the two sets are disjoint. We denote the bijection from S 
to S^ by ■*■. This is the only stage of the construction where we need the 
counter c. The definition of the new machine M = {S, Acc, St, L, S) is a 
modification of the “product” construction. We define S = {Si U S^) x S2, 
St = {Sti,St2) and Acc = {(5'+, T) : T G Acc2}. Initially, we let the counter 
c be 0. The definition of the transition function <5 is as follows. In all cases 
below we assume that i5i(5'i) = Sm and S2{Tj) = T„. We have 
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Cl 62 and ei ;q 62- These are similar to ei ; 62- Also, the proof would be 
virtually identical if ;p had been chosen instead of 



Now it is routine to check that the machine M we built for the event q has 
the required property: an u-long string of sets of primitive events is accepted iff 
the corresponding initial segment under the obvious evaluation of primitive 




80 



J. Bailey and S. Mikulas 



events satisfies q. Further, the size of M is double exponential in terms of the 
size of q. Thus the above decision procedure works in triple exponential time. □ 

Corollary 1 . Let q be an event query in the algebra £2, such that it uses one 
instance of either ;f or ;l- Then satisfiability of q is decidable. 

We now examine the problem of implication for event queries of the above type. 
The problem now becomes undecidable. Intuitively, this is because the individual 
queries can use their or operation in a co-operative fashion to simulate a 
machine with two counters. 

Theorem 4 . Let qi and <72 be event queries in the algebra £2 where each uses one 
instance of either ;f or ;l- Then the problem of whether qi ^ (72 is undecidable. 

Proof. We present a proof for the LIFO consumption semantics, it is virtually 
identical for FIFO. Given a Minsky machine [8] (abbreviated MM and defined 
below), we define a set of primitive and complex events used by an event query 
q, which checks whether the event history is a faithful representation of the 
computation of the MM. q is satisfiable iff the MM terminates. It also has the 
property that it can be rewritten in the form q = qi ||~ 92 and is thus satisfiable 
iff ~'{qi (72)- Thus, we are able to define two queries qi and (72 (each using one 

instance of LIFO), which depend on the MM specification, and qi (72 iff the 
Minsky machine terminates — a problem which is undecidable. 

An MM [8] is a sequence of n instructions So ■ cotoq; : corrii ; ... ;Sn ■ corrin 
where each instruction conii has one of the following forms 

Si : Cl = Cl + 1 ; goto Sj 
St : C2 = C2 + 1; goto Sj 

Si : if Cl = 0 goto Sj else ci = ci — 1 ; goto Sk 
Si : if C2 = 0 goto Sj else C2 = C2 — 1; goto Sk 

where ci and C2 are both counters. Execution begins at instruction So, with ci 
and C2 initialised to zero. Thereafter, the machine executes the given instruction, 
appropriately changes the value of the given counter and jumps to the directed 
instruction. The machine halts iff it reaches instruction S'„. So, the computation 
of the machine can be understood as a (possibly infinite sequence) Si^, Si^, . . . . 
We now define a number of events to ensure that the event history is a faithful 
representation of this sequence (if it is finite). 

— For each machine instruction Si, let be an event. 

“ Cc = 'iL Opop — used for maintaining the value of the 1st counter. 

Intuitively, increases it by one and decreases it by one. 

— = 6p„^^ ;f Cpgp — used for maintaining the value of the second counter. 

“ ebadi,ebad2,eBADi,eBAD2 — events that occur if the history somehow de- 
viates from the MM computation. Each uses multiple definitions and so they 
should in fact be interpreted as the disjunction of their individual definitions. 

“ eany — this event is true at any point in the history. It can be written as 
the disjunction of all defined primitive events. 

— e fir St = eany ||~ (eany', s-any) — the first event in the history. 
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We also rely on the following fact to guarantee correctness of the counters. 
The event ej = occurs iff the number of unconsumed instances of 

^lush was > 0 when occurred. Similarly for el = e^^p. 

Each point of the history corresponds to the machine being in a certain state. 

^badl — ^any |1‘^ (^Si bl I— I ... U CSn ) 

ebadi = eSi II es^ {^i,j G {1, ■ ■ . j) 

Let N be the set of all instruction numbers, Ii {I 2 ) be the set of all instruction 
numbers where counter one (two) is increased. Then for each i G I\,j G {N \/i), 
a G I 2 , b G {N \ I 2 ), we define 

ebadi = {es, 11-^ U (es- || 

ebad2 {eSa. 11'^ || 

Let Di be the set of all instruction numbers where counter one is tested by an 
ifstatement and D 2 be the set of all instruction numbers where counter 2 is 
tested by an z/-statement. Each such test is simulated by an Cp^p or Cp^p event 
occurring. So, for each i G Di,j G (fV \ Di),a G D 2 , b G {N \ D 2 ) we define 

^badl {^Si 11^ ^pop) II ^pop) 

^bad2 {^Sa 11^ ^pop) hi (^Sb || ^popl 

We need to check that the correct ‘gotos’ of the machine are followed. Firstly: 

^badl — ^ first ||^ ^Sq- 

We will use the abbreviation succ(ea) as shorthand for the next event immedia- 
tely following the occurrence of e^: succ{ea) = Gany For an instruction 

of the form Si : Ci = Ci + 1; goto Sj we define 

Gbadi = succ(eSi) ||~ es^ 

For an instruction of the form Si : if ci = 0 goto Sj else ci = ci — 1; goto Sk 

Sbadi = succ{eSi ||~ ej) ||-^ 

Gbadi = succ{eSi II ej) |H es^. 

Lastly, we ensure that if an ebadi or ebad 2 occurs at some point in the history, 
then it will also occur if and when the halt instruction is reached. 

GBADI = Gbadl Gs„ 

GBAD2 = Gbad2 eg„ 

We can now define the query q by 

9 = (es„ ||~ gbadi ||~ GBAD 2 ) = (<?i ||~ 92 ) 

where qi = eg„ ||~ gbadi and q 2 = gbad 2 - Observe that q\ doesn’t depend on 
the event and 52 doesn’t depend on the event e].. Thus, they each use exactly 
one instance of the LIFO operator. □ 



Corollary 2. Let q\ he an event query in the event algebra £2 using two in- 
stances of either ;p or ;l- Let q 2 he an event query written in £\. Then satisfia- 
bility of qi is undecidable and the problem of whether q\ q 2 is undecidahle. 
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5 Summary and Future Work 

We have defined a simple core event language and defined a formal mechanism 
for specifying event consumption policies. We identified a class of policies for 
which equivalence and implication is in pspace. For more elaborate policies, we 
presented a class of event queries for which equivalence was decidable, provided 
each query used the policy only once. We then showed how this then became 
undecidable when testing for implication. In our future work, we plan to inve- 
stigate a) the use of our results for understanding theories of natural numbers 
in fragments of first-order logic with order (e.g., the guarded fragment); b) the 
use of temporal logics such as CTL for expressing properties of interest in event 
languages and for active databases generally. 
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Abstract. We consider transactions running on a database that consists 
of records with unique totally-ordered keys and is organized as a sparse 
primary search tree such as a B-tree index on disk storage. We extend the 
classical read-write model of transactions by considering inserts, deletes 
and key-range scans and by distinguishing between four types of tran- 
saction states: forward-rolling, committed, backward-rolling, and rolled- 
back transactions. A search-tree transaction is modelled as a two-level 
transaction containing structure modifications as open nested subtran- 
sactions that can commit even though the parent transaction aborts. 
Isolation conditions are defined for search-tree transactions with nested 
structure modifications that guarantee the structural consistency of the 
search tree, a required isolation level (including phantom prevention) for 
database operations, and recoverability for structure modifications and 
database operations. 



1 Introduction 

The classical theory of concurrency control and recovery [4] defines transactions 
as strings of abstract read and write operations. This model is inadequate for de- 
scribing transactions on index structures in which data items are identified with 
ordered keys and in which inserts, deletes, and key-range scans are important 
operations. Most importantly, the interplay of the logical database operations 
and the physical, page-level actions is not discussed in the theoretical database 
literature in such a detail that would allow for a rigorous analysis of practical 
database algorithms. This is in contrast to the fact that key-range locking, page 
latching and physiological logging of record-level operations on pages are a stan- 
dard in industrial-strength database systems [5,6,12,13,14,15]. 

Also the actions done by an aborted transaction in its backward-rolling 
phase, or the actions needed at restart recovery when both forward-rolling and 
backward-rolling transactions are present, are not adequately treated in the for- 
mal transaction models. In [1,17,19] general recoverability results are derived 
for extended transaction models by treating an aborted transaction as a string 
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aa~^ , where the prefix a consists of reads and writes (forming the forward- 
rolling phase), and the suffix a~^ is the reversed string of undos of the writes 
in a (forming the backward-rolling phase), thus capturing the idea of logical 
undos of database operations. In this paper we present a transaction model that 
makes explicit the notions of forward-rolling, committed, backward-rolling, and 
rolled-back transactions. Our model allows for recovery aspects to be presented 
in a unified manner, covering aborts of single transactions during normal pro- 
cessing, as well as the actions needed at restart recovery after a system failure 
for a transaction in any of the above four states. 

In our model, the logical database is assumed to consist of records with uni- 
que keys taken from a totally ordered domain. The database operations include 
update operations of the forms “insert a record with key x” and “delete the 
record with key x”, and retrieval operations of the form “retrieve the record 
with the least key x > a (or x > a)”, which allow a key range to be scanned 
in ascending key order, simulating an SQL cursor. A database transaction can 
contain any number of these operations. Isolation aspects are discussed following 
the approach of [3], where different isolation anomalies are defined and analy- 
zed. The definitions of dirty writes, dirty reads, unrepeatable (or fuzzy) reads, 
and phantoms are adapted to the model. Isolation anomalies are prevented by 
key-range locking [5] (also called key- value locking in [12,13]). 

We assume that our database is organized physically as a primary search-tree 
structure such as a B-tree [2] or a B-link-tree [7,16] whose nodes are disk pages. 
The non-leaf pages (or index pages) store router information (index terms), while 
the leaf pages (or data pages) store database items (data terms). Algorithms on 
search trees can achieve more concurrency than would be possible if strict conflict 
serializability were required for accesses of index terms as well as for accesses 
of data terms. This is because an internal database algorithm can exploit the 
fact that many valid search-tree states represent one database state [2,7,16,18]. 
Accordingly, all index-management algorithms in industrial-strength database 
management systems allow nonserializable histories as regards read and write 
accesses on index pages. A shared latch held on an index page when searching for 
a key is released as soon as the next page in the search path has been latched, 
and exclusive latches held on pages involved in a structure modification such 
as a page split or merge are released as soon as the structure modification is 
complete [5,9,10,11,12,13,14,15]. 

Following [9], we model a search-tree transaction as a two-level transac- 
tion that in its higher or database level contains database operations, that is, 
data-term retrievals, inserts and deletes on leaf pages of the tree, together with 
index-term retrievals needed to locate the leaf pages. In its lower or structure- 
modification level, all update operations that change the structure of the tree 
are grouped into one or more lower-level or structure-modification transactions. 
When the process that is generating a higher-level transaction finds that a change 
(such as a page split) in the structure of the tree is needed, it starts executing a 
lower-level transaction to do the structure modification, and when that is done 
(committed), the execution of the higher-level transaction is continued. It is 
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the responsibility of the designer of the search-tree algorithms to determine the 
boundaries of the lower-level transactions and to guarantee that each such tran- 
saction, when run alone in the absence of failures on a structurally consistent 
search tree, always produces a structurally consistent search tree as a result. 
The commit of a structure-modification subtransaction is independent of the 
outcome of the parent search-tree transaction, so that once a subtransaction has 
committed it will not be undone even if its parent aborts. On the other hand, 
the abort of a subtransaction will also imply the abort of the parent. Thus, the 
model is a variant of the “open nested” transaction model, see e.g. [5,20]. 

We derive isolation conditions for search-tree operations and structure mo- 
difications that guarantee that a history of search-tree transactions with nested 
structure-modification transactions preserves the consistency of both the logical 
and physical database. We also show that, given a history H of database tran- 
sactions on a database D where the transactions are sufficiently isolated and H 
can be run on D, then for any structurally consistent search tree B that repre- 
sents D there exists a history H of isolated search-tree transactions with nested 
structure-modification transactions on B that implements H, that is, maps the 
logical database operations (key inserts, deletes and retrievals) into physiological 
operations (data-term inserts, deletes and retrievals) on leaf pages of the search 
tree such that the tree produced by running H on B is structurally consistent 
and represents the database produced by running H on D. 

The isolation conditions can be enforced by standard key-range locking and 
page latching protocols, and they also guarantee a recoverability property that 
allows an ARIES-based [14] algorithm to be used for restart recovery. For struc- 
ture modifications, both redo and undo recovery is physiological [5], that is, 
page-oriented. For database updates, redo recovery is physiological, but undo 
recovery can also be logical, so that the undo of a key insert involves a traversal 
down the tree in order to locate the page in which the key currently resides, un- 
less the key is still found in the same page into which it was originally inserted, 
in which case the update can be undone physiologically [9,12,13,15]. 

2 Database Operations and Transactions 

We assume that our database D consists of database records of the form (x,v), 
where x is the key of the record and v is the value of the record. Keys are unique, 
and there is a total order, <, among the keys. The least key is denoted by — oo 
and the greatest key is denoted by oo. We assume that the keys — oo and oo do 
not appear in database records; they can only appear in search-tree index terms. 

In normal transaction processing, a database transaction can be in one of the 
following four states: forward-rolling, committed, backward-rolling, or rolled- 
back. A forward-rolling transaction is a string of the form ba, where b denotes 
the begin operation and o? is a string of database operations. In this paper we 
consider the following set of database operations (cf. [12,13,15]): 

1) r[x, Oz, u]: retrieve the first matching record (x, v). Given a key z, find the 
least key x and the associated value v such that x 9 z and the record (x,v) is 
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in the database. Here 0 is one of the comparison operators “>” or “>” . We also 
say that the operation reads the key range [z,x] (if 0 is “>”) or {z,x\ (if 0 is 
“>”)• 

2) n[x,v\. insert a new record (x,v). Given a key x and a value v such that 
X does not appear in the database, insert the record (x, v) into the database. 

3) d[x, f]: delete the record with key x. Given a key x in the database, delete 
the record, (x,v), with key x from the database. 

A committed transaction is of the form bac, where ba is a forward-rolling 
transaction and c denotes the commit operation. An aborted transaction is one 
that contains the abort operation, a. A backward-rolling transaction is an aborted 
transaction of the form baf3af3~^ , where bafd is a forward-rolling transaction and 
is the inverse of /3 (defined below). The string a/3 is called the forward-rolling 
phase, and the string the backward-rolling phase, of the transaction. 

The inverse j3~^ of an operation string (3 is defined inductively as follows. For 
the empty operation string, e, the inverse e~^ is defined as e. The inverse (/3o)“^ 
of a non-empty operation string (3o, where o is a single operation, is defined as 
o~^ , where o~^ denotes the inverse of operation o. The inverses for our set 

of database operations are defined by (cf. [1,17,19]): (1) r~^[x,0z,v] = e; (2) 
n~"^[x,v] = d[x,v\, (3) d“^[a;,u] = n[x,v]. 

A backward-rolling transaction baf3af3~^ thus denotes an aborted transaction 
that has undone a suffix, j3, of its forward-rolling phase, while the prefix, a, is 
still not undone. An aborted transaction of the form baaa~^c is a rolled-back 
transaction or an aborted transaction that has completed its rollback. Thus we 
use the operation name c to denote the completion of the rollback of an aborted 
transaction as well as the commit of a committed transaction. A forward-rolling 
or a backward-rolling transaction is called an active transaction. 

For example, br[x, >z, v]d[x, v]n[x, v -\- t]ad[x, v -\- t]n[x, v]c, where 2 and t are 
constants and x and v are free variables, is a rolled-back aborted transaction 
which in its forward-rolling phase retrieves the record, {x,v), with the least key 
X > z (thus reading the key range [z,x]) and increases the value for that record 
by t and which in its backward-rolling phase undoes its updates. 

An operation o, an operation string a, or a transaction T, is ground, if it con- 
tains no free variables. A ground operation string represents a specific execution 
of the operation string. A history for a set of ground transactions is a string H 
in the shuffle of those transactions. is a complete history if all its transactions 
are committed or rolled-back. 

For a forward-rolling transaction ba, the string aa~^c is the completion 
string, and baaa~^c the completed transaction. For the backward-rolling tran- 
saction ba(3aj3~^, the string a“^c is the completion string, and baj3af3~^a~^c 
the completed transaction. A completion string 7 for an incomplete history H is 
any string in the shuffle of the completion strings of all the active transactions 
in H', the complete history is a completed history for H. 

Let I? be a database. We define when a ground operation or operation string 
can be run on D and what is the database produced. Any retrieval operation 
r[x,0z,v] can always be run on D, and so can the operations b, c, and a; the 
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database produced is D. If Z? does not contain a record with key x, an insert 
operation n[x,v] can be run on D and produces D U {(cc,u)}. If D contains 
(x,v), the delete operation d[x,v] can be run on D and produces D \ {(x,u)}. 
The empty string e can be run on D and produces D. If an operation string a 
can be run on D and produces D' and an operation o can be run on D' and 
produces D", then the string ao can be run on D and produces D" . 



3 Isolation of Database Transactions 

In this section we give simple isolation conditions that can be enforced by stan- 
dard key-range locking and that guarantee both execution correctness (seria- 
lizability) and recoverability of transaction histories. In [3], the SQL isolation 
levels are analyzed and different isolation anomalies, such as dirty writes, dirty 
reads, unrepeatable (or fuzzy) reads, and phantoms, are defined for a non-serial 
history of transactions. For our model, we redefine dirty writes, dirty reads and 
unrepeatable reads so as to encompass also different types of phantoms. 

Let H = a[3 be a history and T a transaction of H such that the begin- 
transaction operation 5 of T is contained in a. Let T' be the prefix of T contained 
in a. We say that T is forward-rolling, committed, backward-rolling, rolled-hack, 
or active, in a, if T' is forward-rolling, committed, backward-rolling, rolled-back, 
or active, respectively. 

A key x inserted or deleted in a has an uncommitted update by T in a if 
one of the following three statements holds: (1) T is forward-rolling in a and 
the last update (i.e., insert or delete) on a; in a is by T; (2) T is backward- 
rolling in a and the last update on x in a is by the forward-rolling phase of T 
(which update thus has not yet been undone); (3) T is backward-rolling in a 
and the last update on x in a is an inverse operation o“^[x] by T where the 
corresponding forward operation o[x] by T is not the first update on x by T (so 
that there remain updates on x by T that have not yet been undone). Once the 
first update on x in the forward-rolling phase of T has been undone, we regard 
X as committed, because in our transaction model the inverse operations done 
in the backward-rolling phase are never undone [12,13,14]. 

Let Oi be a database operation, a, (3 and 7 operation strings, and H = 
aoif)^ a history containing any number of forward-rolling, committed, backward- 
rolling, and rolled-back transactions. Further let Ti and Tj be two distinct tran- 
sactions of H. 

1) Oi is a dirty write on key x by Ti in H if oi is an update by Ti on x, where 
X has an uncommitted update by Tj in a. 

2) Oi is a dirty read by Ti in H if Oi is a retrieval operation that reads a key 
range containing some key x that has an uncommitted update by Tj in a. 

3) Oi is an unrepeatable read by Ti in H if Oi is a retrieval operation that 
reads a key range containing some key x, Ti is forward-rolling in aOi(3, and the 
first operation in 7 is an update on x by Tj. Note that Oi is not considered 
unrepeatable in the case that Ti is backward-rolling in aOiP, because Ti will 
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eventually undo all its updates, including those possibly based on the retrieved 
record. 

When given an incomplete history H describing the state of transaction pro- 
cessing at the moment of a system crash, we should be able to roll back all 
the forward-rolling transactions and complete the rollback of all the backward- 
rolling transactions in H. In other words, we should find a completed history 
that can be run on every database on which H can be run. Such a history does 
not necessarily exist at all if H is non-strict [4], that is, contains dirty writes by 
committed transactions. On the other hand, if H contains no dirty writes at all, 
then for any completion string 7 of H, the completed history Hj contains no 
dirty writes and can be run on every database on which H can be run. Thus, 
each active transaction in H can be rolled back independently of other active 
transactions. It is also easy to see that if H contains no dirty writes, dirty reads 
or unrepeatable reads, then any completed history Hj is also free of those an- 
omalies and is conflict-equivalent to some serial history of the transactions of 
i ?7 and equivalent to (i.e., produces the same database as) some serial history 
of the committed transactions of H. Such a history H is also prefix-reducible, 
forward-safe and backward-safe [1,17,19]. 

Using the standard key-range locking protocol [5,12,13], all of the above an- 
omalies can be avoided. In this protocol, transactions acquire in their forward- 
rolling phase commit-duration X-locks on inserted keys and on the next keys 
of deleted keys, short-duration X-locks on deleted keys and on the next keys 
of inserted keys, and commit-duration S-locks on retrieved keys. Short-duration 
locks are held only for the time the operation is being performed. Commit- 
duration X-locks are released after the transaction has committed or completed 
its rollback. Commit-duration S-locks can be released after the transaction has 
committed or aborted. No additional locks are acquired for operations done in 
the backward-rolling phase of aborted transactions: an inverse operation o“^[a;] 
is performed under the protection of the commit-duration X-lock acquired for 
the corresponding forward operation o[x\. Thus we have: 

Lemma 1. If a history H can be run on a database D under the key-range 
locking protocol, then for any completion string 7 for H, the completed history 
Hj can be run on D under the key-range locking protocol. □ 



4 Search- Tree Operations and Transactions 

We assume that our search trees are similar to B-trees [2] or B-link-trees [7,16]. 
In a B-tree, each child node is directly accessible from its parent node, while in a 
B-link-tree some children (except the eldest) may only be accessed via a side link 
from the next elder sibling. In a B-tree, only the leaf nodes are sideways linked 
from left to right (to allow efficient key-range scans, cf. [12,13,15]), while in a 
B-link-tree there is a left-to-right side-link chain on every level of the tree. We 
assume the tree is used as a sparse index to the database, so that the database 
records are directly stored in the leaf nodes. 
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Formally, a search tree is an array i?[0, . . . , iV] of disk pages B[p] indexed by 
unique page numbers p = 0, . . . , N. The page B[p] with number p is called page 
p, for short. Each page (other than page 0) is labelled either allocated, in which 
case it is part of the tree structure, or deallocated otherwise. Page 0 is assumed 
to contain a storage map (a bit vector) that indicates which pages are allocated 
and which are deallocated. Page 1, the root, is always allocated. The allocated 
pages form a tree rooted at 1. 

Each page p covers a half-open range of keys x, low-key (p) <x< high-key (p) . 
For each level of the tree, the sequence pi,. . . ,pn of pages on that level must 
form a partition of the key space (— 00 , 00 ). Accordingly, on each level, the low 
key of the first (or leftmost) page is — 00 , the high key of the last (or rightmost) 
page is 00 , and high-key(pi) = low-key(pi+i), for i = 1, . . . , n — 1. 

An allocated page is an index page or a data page. An index page p is a 
non-leaf page and it contains index terms of the form {x, q), where a: is a key, q 
is a page number, low-key (p) < x < high-key (p), and (if x < 00 ) x = low-key (g). 
Index terms (x, q) with x < high-key(p) are child terms, and the term (x, q) with 
X = high-key(p) (if present) is the high-key term. The high-key term is present 
in a B-link-tree. The page number q in the high- key term is that of next(p), the 
page next to p on the same level (when x < 00 ). A data page p is a leaf page and 
always contains the high-key term and a set of data terms {x, v) with low-key(p) 

< X < high-key (p). The set of data terms in the data pages of a search tree B 
is called the database represented by B and denoted by db(B). 

A search tree is structurally consistent if it satisfies the above definition and 
any additional conditions required by the specific tree type in question, such as 
B-tree balance conditions (minimum fill factor for pages, etc). 

A search-tree transaction can contain the following search-tree operations. 

1) r[p, X, 9z, ?;]: given a data page p and a key z > low-key(p), retrieve from p 
the term (x, v) with the least key x 9 z (if z < high-key (p)) or the high-key term 
(x, v) (if 2 > high-key(p)). When the least key x 9 z is not greater than the key 
of the last data term in p, the operation corresponds to database transaction 
operation r[x,9z,v\] the difference is that the search-tree operation requires a 
page p onto which the operation is applied. The operation is then called a data- 
term retrieval on page p. Otherwise, the operation retrieves the high-key term 
of p, and the operation is then called an index-term retrieval on p. 

2) r[p,x,9z,q\\ given an index page p and a key z > low-key(p), retrieve 
from p the index term (x, q) with the least key x 9 z (if z < high-key(p)) or the 
high-key term (x, q) (if z > high-key(p)). The operation is used to traverse from 
one page to another in the search tree. The operation is called an index-term 
retrieval on p. 

3) n[p, X, u]: given a data page p, a key x and a value v such that low-key (p) 

< X < high-key(p) and p does not contain a data term with key x and p can 
accommodate (x, v), insert the data term (x, v) into p. The operation corresponds 
to database transaction operation n[x,v]. 

4) d[p, X, v]: given a data page p and a key x such that p contains a data term, 
(x, v), with key x, and p will not underflow if (x, v) is deleted from p, delete (x, v) 
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from p. The operation corresponds to database transaction operation d[x,v\. — 
We say that n[p,x,v] and d[p,x,v] are data-term updates on p. 

Operations 1 to 4 above do not change the structure of the tree. The struc- 
ture can be changed by operations 5 to 9 below, which are called structure 
modifications (or sm’s, for short). 

5) n[p,f\: allocate a new page p of type t = i (for index page) or t = d (for 
data page). 

6) d[p,t]: deallocate an empty page p. Operations 5 and 6 also update the 
storage map page 0. 

7) n[p,x,q\: insert a new index term {x,q) into an index page p that can 
accommodate (x,q). 

8) d[p,x,q]: delete an index term (x,q) from an index page p. Operations 7 
and 8 are called index-term updates on p. 

9) m[p,x,v,p']: move an index or data term (x,v) from page p to a non-full 
page p'. — We say that n[p,t] and d[p,t] are sm’s on pages 0 and p and that 
n[p, X, q] and d[p, x, q] are sm’s on p and that m[p, x, v,p'] is an sm on p and p'. 

For example, a page split in a B-link-tree could be implemented by sm’s as 
follows. Assume that page p is found to be full. First, n[p',t] is used to allocate 
a new page p'. Then m[p,x,v,p'] is used to move the upper half of the terms 
(including the high-key term) from page p to page p', moving one term (x, v) at 
a time. Finally a new high-key term {x' ,p') is inserted into p by the operation 
n[p, x',p'], where x' is the least key in p' . 

In a natural way we can define when a ground index-term retrieval, a data- 
term retrieval, data-term update, an sm, or any string of such operations can be 
run on a given search tree B and what is the search tree thus produced. For an 
operation to be runnable, all the stated preconditions for that operation must 
be satisfied. For example, a ground operation d[p, t] can be run on B if and only 
if page p is allocated, empty, and of type t. The search tree produced is obtained 
from B by marking p as deallocated in the storage-map page 0. Similarly, a 
ground operation n[p, x, x] can be run on B only if p is a data page with y < x < 
high-key (p) and some index page of B contains the index term (y,p) (indicating 
that y = low-key(p)). Then if p has room for (x,u) and does not yet contain a 
term with key x, (x,u) can be inserted into p. 

All data-term retrievals and data-term updates have projections onto their 
corresponding database operations: Tr{r[p,x,9z,v]) = r[x,9z,v]; Tr{n[p,x,v]) = 
n[x,v]; 7r((i[p, X, x]) = d[x,v]. For index-term retrievals and for all sm’s o we set 
7 t(o) = e. The projection operation tt is extended in the natural way to operation 
strings a: 7r(e) = e; 7r(ao) = 7r(o;)7r(o). 

Lemma 2. Let i? be a structurally consistent search tree. If o[p, x] is a ground 
data-term retrieval or update on data page p of B that can be run on B, then 
7 t(o[p, x]) = o[x] can be run on db(B). □ 

For the data-term updates and sm’s, physiological inverses are defined: n“^[p, 
x,v] = d[p,x,v]; d~'^[p,x,v] = n[p,x,u]; n"^[p,t] = d[p,t]; d-'^[p,t] = n[p' ,t\, 
n"^[p,x,g] = d[p,x,q]-, d~'^[p,x,q] = n[p,x,q]] m"i[p,x,x,p'] = m[p',x,v,p]. 
A physiological operation [5] is always applied to a given page, whose number 
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is given as an input argument of the operation. All sm’s are always undone 
physiologically, while a data-term update may be undone either physiologically 
or logically. 

The sm’s can only appear in a structure-modification transaction, or an sm 
transaction, for short, which is an open transaction nested in a search-tree tran- 
saction. An open transaction can commit independently from the outcome (com- 
mit or abort) of the parent transaction [5]. This property is essential in allowing 
sufficient concurrency between search-tree transactions that contain multiple 
data-term retrievals and updates. Once an sm transaction nested in a search- 
tree transaction T has been successfully completed, the X-latches on the modified 
pages can be released so that the pages can be accessed by other transactions 
while T is still active (and thus may later abort). However, if an sm transaction 
aborts, then the parent transaction will also be aborted once the sm transaction 
has been rolled back. 

An sm transaction may take one of the forms ba, bac, bafdaP~^, or baaa~^c, 
that is, forward-rolling, committed, backward-rolling, or rolled-back. Here the 
forward-rolling operation strings a and /3 consist of sm’s, and a~^ and 
consist of their physiological inverses. A committed sm transaction T is correct 
if, for any structurally consistent search tree on which T can be run, T produces a 
structurally consistent tree as a result. We assume that a correct sm transaction 
exists for handling any overflow or underflow situation that might appear. 

For example, a page split in a B-tree must be embedded in an sm transaction. 
When a page split propagates up the tree, due to parent pages in the search path 
that are full and therefore cannot accommodate the child terms for the new 
pages created by the splits, then, in a conventional B-tree, the entire series of 
page splits must be embedded into one sm transaction (a “nested top action” in 
[12,13,15]). This is necessary because a B-tree is not in a structurally consistent 
state in the middle of two splits in the series. On the contrary, a B-link-tree is 
kept consistent between any two splits done on neighboring levels of the tree, 
thus making possible the design of shorter sm transactions [9,10,11]. 

Every sm transaction is nested in some higher-level transaction, called a 
search-tree transaction, which is the parent of the sm transaction. Let cp be any 
string over index-term retrievals, data-term retrievals, data-term updates, and 
committed sm transactions. A search-tree transaction T is forward-rolling if it is 
a string of the form bp or bpS, where S' is a forward-rolling sm transaction, and 
committed if it is of the form bpc. T is aborted during a structure modification 
if it is of the form bpS, where S is a backward-rolling sm transaction. T is 
backward-rolling if it is of the form 6i,5(e|Si)ai5(e|S2), where Si is a rolled-back 
sm transaction, (5 is a string over index-term retrievals, data-term updates, and 
committed and rolled-back sm transactions and S 2 is a backward-rolling sm 
transaction such that tt{ 6) = Tr{p')~^, for some suffix p' of p. (Here (o;i|a 2 ) 
denotes Oi or 02 -) T is rolled-back if it is of the form bp{e\S)aSc, where S is 
a rolled-back sm transaction and <5 is a string over index-term retrievals, data- 
term updates, and committed and rolled-back sm transactions such that tt{6) = 
7t(p}~^. Note that a logical undo may cause sm transactions to be invoked, and 
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those sm transactions may have to be rolled back should a failure occur during 
the rollback. 

Completion strings and completed transactions are defined in analogy with 
database transactions. For an active sm transaction, the completion string is 
uniquely defined and consists of the physiological inverses of the sm’s, while for 
an active search-tree transaction we only require that the projection of a com- 
pletion string on the database operations must give the inverse of the projection 
of the not-yet-undone portion of the forward-rolling phase. 

Let D be a database and o[x] (with parameter list x) a ground retrieval or 
update on some key x that can be run on D. Let i? be a structurally consistent 
search tree with db(i3) = D and let p be a data page of B. We define that the 
search-tree operation o[p, x] on S is a physiological implementation of o[x] on B if 
the list of parameters p, x satisfies the preconditions stated above for a data-term 
retrieval or update o and if o[p,x\ can be run on B. Thus a data-term retrieval 
r[p, X, 6z, ?;] is a physiological implementation of r[x, 9z, u] if low-key(p) < z, x < 
high-key(p), and (x, v) is the data term in p with the least key x 9 z. A data-term 
insert n[p,x,v] is a physiological implementation of n[x,v] if low-key(p) < x < 
high- key (p), p contains no data term with key x, and p has room for (x,v). 

Physiological implementations of operations can be used, if possible, when 
previous operations of the search-tree transaction provide good guesses of the 
data page p on which the next operation will go. For example, in implementing 
a string of retrievals r[xi, > xq, xi], r[x 2 , > xi, X 2 ], the two data terms to be 
retrieved most probably reside in the same page p, so that when p has been 
located by a root-to-leaf traversal in search for the key xq and (xi,xi) has been 
retrieved from p and p has been unlatched, the page number of p is remembered 
and a physical implementation of the latter retrieval is tried by relatching p and 
(if this succeeds) by examining the contents of p in order to find out if p indeed 
is the page on which the retrieval should be run. 

Clearly, a physiological implementation does not always exist. A data page p 
may be too full to accommodate an insert or too underfull to allow a delete. In 
the case of a retrieval r[x, 9z, x] the data page p that covers the key z may not 
be the one that holds (x,x), so that the side link to the next page p' must be 
followed, and the search-tree operation used to retrieve (x, x) will take the form 
x[p^a;, >z',v], for z' = low-key(p') > z. 

The definition of a structurally consistent search tree B with db(B) = D 
implies that for any retrieval r[x, 9z, x] that can be run on D there is a search- 
tree operation string of the form 

7 = r[pi,xi, 6 liZi,p 2 ]r[p 2 ,X 2 , 6 » 2 Z 2 ,P 3 ]- ■ .r[p„, x„, 6 »„z„,p„+i]r[p„+i, x, 6 »'z', x] 

that can be run on B. Here pi is the root of the tree, r[pi,Xi,9iZi,pi+i] is an 
index-term retrieval, i = 1, . . . ,n {n > 0), and r[p„+i, x, 6 *'z', x] is a data-term 
retrieval. Also, if z = z' then 9 = 9'\ else z < z', 9' is “>” and db(H) contains 
no key y with z < y < z' . We call 7 a logical implementation of r[x, 9z, x] on B. 

For an update o[x, x], o G {n, d}, that can be run on Z? = db(H), there exists 
a logical implementation on B of the form ao[p, x, x]/3 that can be run on B. Here 
p is a data page of B, and a and [3 consist of index-term retrievals and correct 




A Theory of Transactions on Recoverable Search Trees 



93 



committed sm transactions. Note that this general form captures strategies in 
which page splits or merges are performed bottom-up, as well as strategies in 
which they are performed top-down. 

For a data-term update o[p,x,v], o G {n,d}, a logical inverse is any logical 
implementation ao~^[p', x, v]j3 of o~^[x, u] (= 7r(o[p, x, w])“^). Note that a record 
(x,v) inserted into page p may have moved to page p' by the time the insert is 
undone. Similarly, the page that covers key x at the time a delete of (x, v) is 
undone may be different from the page the record was deleted from. 

5 Isolation of Search- Tree Transactions 

Let Ti and Tj be two search-tree or sm transactions that potentially can run 
concurrently in a history H, that is, neither is the parent of the other and they 
are not both sm transactions of the same parent. In the case that both Ti and Tj 
are search-tree transactions we define dirty writes, dirty reads and unrepeatable 
reads in the same way as for database transactions taking into account only 
retrievals and updates of data terms, so that there is an anomaly between Ti 
and Tj in H if and only if the same anomaly exists between 7r(Ti) and r^{Tj) in 
7r(iL). Since search-tree transactions cannot update index pages, no anomalies 
are defined between operations on index pages by two search-tree transactions. 

In the case that one or both of Ti and Tj are sm transactions we have to 
consider anomalies arising from conflicting accesses to index or data pages. Let 
H be of the form a/3 and let T be an sm transaction of H. A page p has 
an uncommitted structure modification by T in a if one of the following three 
statements holds: (1) T is forward-rolling in a and the last sm on p in a is by T; 
(2) T is backward-rolling in a and the last sm on p in a is by the forward-rolling 
phase of T; (3) T is backward-rolling in a and the last sm on p in a is the inverse 
of an sm o on p by T where o (in the forward-rolling phase of T) is not the first 
sm on p by T . Cf. the definition of an uncommitted update in Section 3. 

Now consider the case in which Ti and Tj are both sm transactions of H = 
aOi/3, where Oi is an sm on page p by Ti. We define that Oi is a dirty write on p by 
Ti if p has an uncommitted sm by Tj in a. Thus, for example, an insert operation 
Oi = ni[p,x,q\ on an index page p is a dirty write in H if Tj is forward-rolling 
in a and the last sm in a on p is dj[p,x' ,q'], even if x' yf x. Updates on page 
p by Ti that consume space released by Tj must be prevented, so that p would 
not overflow when Ti commits but Tj is rolled back [12,13,15]. Note that an 
sm transaction can only do physiological undo. Also updates by Ti that release 
space consumed by Tj must be prevented, so that the page would not underflow 
unexpectedly when Ti commits but Tj is rolled back. 

Then consider the case in which Tj is a search-tree transaction and Ti is 
an sm transaction. Tj can retrieve index and data terms and insert and delete 
data terms, while Ti can insert and delete index terms, allocate and deallocate 
empty index and data pages, and move index and data terms. No dirty writes are 
defined in this case. In particular note that it is not a dirty write if Ti moves a 
data term inserted by an active Tj. Nor is it a dirty write if Ti deallocates a data 
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page p from which an active Tj has deleted the last data term. No dirty reads are 
defined, because sm transactions do no reads. Unrepeatable reads appear in the 
case that Tj retrieves a key range of index terms while traversing down the tree, 
and then Ti inserts or deletes terms in that range. However, in order that such an 
anomaly could violate database integrity, Tj should use the retrieved information 
for updating. Since Tj can only update data terms, and it is reasonable to assume 
that those updates in no way depend on the particular search path followed, we 
do not define any anomaly in this case either. 

Finally consider the case in which Tj is an sm transaction and Ti is a search- 
tree transaction of H = aOi/3, where Oi is an operation by Ti on a page p that 
has an uncommitted sm by Tj in a. We define that Oi is a dirty write on p by Ti 
if Oi is a data-term update on p. The justification for this anomaly is as for the 
dirty writes between two sm transactions. We define that Oi is a dirty read on p 
by Ti if Oi is an index-term or data-term retrieval on p. No unrepeatable read 
can appear since Tj does no reads. 

Theorem 3. Let iL be a history of search-tree transactions with nested 
sm transactions such that every sm transaction is either committed or rolled- 
back, all the committed sm transactions are correct, and H contains no dirty 
writes defined above for pairs of transactions of which at least one is an sm 
transaction. Assume that H can be run on a structurally consistent search tree 
B and produces a search tree B' . Then B' is structurally consistent, and db(H') 
is the database produced by running tt{H) on db(i?). □ 

Latching [5,14] is used to guarantee the physical consistency of pages under 
updates. The process that generates a search-tree transaction and its nested 
sm transactions must acquire a shared latch (S-latch) on any page retrieved for 
reading and an exclusive latch (X-latch) on any page retrieved for updating. 
The action of acquiring a latch is sometimes combined with the buffer manager 
operation that is used to fix a page into the buffer. The unlatching of a page 
is similarly combined with the unfixing of the page. Latch- coupling [2,5,12] is 
a common way to guarantee the validity of traversed search-tree paths. The 
latching protocol used is assumed to be deadlock- free. This is guaranteed if an 
S-latch is never upgraded to an X-latch and if a parent or elder sibling page is 
never latched when holding a latch on a child or a younger sibling. We assume 
that any logical or physiological implementation on a search tree H of a database 
operation on db(i?) can be run on B under the latching protocol, starting with 
an empty set of latched pages and ending with an empty set of latched pages 
(see e.g. [9,11,12]). Full isolation for sm transactions is obtained if each page p 
accessed by an sm transaction S is kept X-latched until S commits or, if S is 
aborted, until all updates by S' on p have been undone. 

Lemma 4. Let H he & history of transactions on database D. Construct 
from H a history H' by replacing some retrievals r[x, 6z, v] by r[x, >z' , v] where 
X > z' 6 z. Then H can be run on D under the key-range locking protocol if and 
only if H' can be run on D under the key-range locking protocol. □ 

Let H be a history of database transactions that can be run on a database D 
and let i? be a structurally consistent search tree with db(i?) = D. A history H 
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of search-tree transactions with nested sm transactions is an implementation of 
H on B ii H can be run on every sm transaction in H is either committed and 
correct or rolled-back, and tt{H) = if' where ff' can be constructed from H as 
in Lemma 4. Here H' is a history of transactions where T is a search-tree 
transaction in H. 

Theorem 5. Let H he & history of database transactions that can be run on 
a database D under the key-range locking protocol. Then for any structurally 
consistent search tree B with db(i?) = D, there exists an implementation H of 
H such that all the sm transactions in H are committed and correct, and H can 
be run on B under the latching and locking protocols. □ 

Theorem 6. Let iL be a history of search-tree transactions with nested 
sm transactions such that all the committed sm transactions are correct and 
H contains no isolation anomalies and can be run on a structurally consistent 
search tree B. Then tt{H) is conflict-equivalent to a serial history Hi of the 
transactions in tt{H) and there exists an implementation Hi of Hi on B that 
is a serial history of some search-tree transactions with nested committed and 
correct sm transactions. □ 

The result of Theorem 6 can be characterized by saying that H is “data- 
equivalent” [16] to a serial history Hi. Note however that the transactions in 
Hi are usually not the same as those in H. Nor is the search tree produced by 
running Hi on B necessarily the same as that produced by H on B, although it 
is true that they represent the same database. 

6 Recoverability 

Following ARIES [14], we assume that each data-term update o[p,x,v], o £ 
{n,d}, performed by a search-tree transaction T in its forward-rolling phase is 
logged by writing the physiological log record (T, o,p, x, v, n), where n is the log 
sequence number (LSN) of the log record for the previous data-term update of 
T, and that the inverse o~^[p' ,x,v] of such an update performed by an aborted 
search-tree transaction T in its backward-rolling phase is logged by writing the 
compensation log record {T,o~^,p',x,v,n). Similarly, we assume that any sm 
(i.e., page allocation, page deallocation, index-term update or term move) per- 
formed by an sm transaction S in its forward-rolling phase is logged by writing 
a physiological log record that contains the transaction identifier S, the name 
and arguments of the sm operation, and the previous LSN n of S, and that 
the (physiological) inverse of such an sm is logged by writing the corresponding 
compensation log record. The LSN of the log record is stamped in the PageLSN 
field of every page involved in the update. The write-ahead logging protocol [5, 
14] is applied, so that a data or index page with PageLSN n may be flushed onto 
disk only after flushing first all log records with LSNs less than or equal to n. 

Let H be an incomplete history representing the state of transaction pro- 
cessing after a system crash when the redo pass of restart recovery [14] has 
been completed and the undo pass is about to begin. In the standard ARIES 
algorithm, the undo pass is performed by a single backward sweep of the log. 
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rolling back all the active transactions. As noted in [9,12,13)15], this may not be 
possible when active sm transactions are present, because the logical inverse of 
a data-term update logged after an sm performed by an active sm transaction 
may see a structurally inconsistent tree. The following theorem states that it is 
possible to recover by rolling back all the active sm transactions first (cf. [9]). 

Theorem 7. Let H he & history of search-tree transactions with nested 
sm transactions such that all the committed sm transactions are correct and H 
can be run on a structurally consistent search tree B under the latching protocol 
(with full isolation for sm transactions) and under the key-range locking protocol. 
Further let 7 be any string in the shuffle of the completion strings for the active 
sm transactions in H . We have: 

(1) can be run on B under the latching and locking protocols and pro- 
duces a structurally consistent search tree B' . 

Now let Ip be any string in the shuffle of the completion strings for the active 
database transactions in Then there exists a search-tree operation string 

Ip such that the following statements hold. 

(2) Tr{'ip) = ip and yf/l is a completion string for H. 

(3) All the sm transactions in xp are committed and correct. 

(4) ip can be run on B' under the latching and locking protocols and produces 
a structurally consistent search tree B^. □ 

Theorem 7 covers the general case in which sm transactions may be of ar- 
bitrary length, as is the case with the conventional B-tree in which a series of 
page splits or merges propagating from a leaf page up to the root page must all 
be enclosed within a single sm transaction. In the event of a crash, such an sm 
transaction may have some of its effects reflected on disk and on the log while 
some effects are not, so that the sm transaction must first be rolled back before 
the (logical) undoing of the database updates can begin. If a new crash occurs 
while the undo pass is still in progress there may be new sm transactions active, 
caused by the logical inverses, which then must be rolled back before completing 
the rollback of the parent search-tree transaction. It is easy to see that there is 
no upper limit on the number of sm’s that repeated crashes can cause. 

In the case of a B-link-tree, on the other hand, sm transactions can be made 
short: an sm transaction may consist of a single index-term insert, a single index- 
term delete, a single page split, or a single page merge (cf. [9,10,11]). Each of these 
operations retains the structural consistency of a B-link-tree, when our balance 
conditions allow a page to have several sibling pages that are not directly child- 
linked to their parent page but are only accessible via the side links. Moreover, 
each of the four operations only involves a small fixed number of pages that need 
to be fixed and latched in the buffer during the operation, and the operation can 
be logged by a set of log records that fit into a single page. For example, an sm 
transaction for the split of a page p includes a page allocation n[p', t], a number 
of term moves m[p,x,v,p'] that are used to move about a half of the terms in 
page p into page p' , and an index-term insert n[p,y,p'] that inserts into p the 
high- key term {y,p') where y is the least key in p'. 
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The log records describing the split, together with the begin and commit log 
records of the sm transaction, can be packed into a single log record that occupies 
only about a half of a page. That log record is written and the pages p and p' 
are unlatched and unfixed only after the completion of the split operation, so 
that in the event of a crash, either the log on disk contains no trace of the split 
(in which case there is no trace of it in the database pages on disk either), or the 
entire log record is found on disk. In the latter case the log record may have to be 
replayed in the redo pass of recovery. In neither case is there anyhing to be done 
for rolling back sm transactions. Then in Theorem 7 we may restrict ourselves 
to the case in which 7 = e and all the sm transactions in H are committed. This 
means that the undo pass of ARIES can readily be started to run the (logical) 
inverses of the data-term updates of active search-tree transactions. 

7 Conclusion 

Industrial-strength database systems use index-management algorithms that are 
based on a sophisticated interplay of the logical record-level operations and the 
physical page-level operations so that maximal concurrency is achieved while 
not sacrificing recoverability [12,13,15]. A key solution is to regard a series of 
structure modifications (sm’s) on the physical database as an open nested tran- 
saction that can commit even if the parent transaction aborts. A light-weight 
implementation of such sm transactions can be provided by ARIES nested top 
actions [14]. 

We have presented a transaction model that serves as a theoretical frame- 
work for analysing index-management algorithms in the context of B-trees and 
B-link-trees. Transactions on the logical database can contain an arbitrary num- 
ber of record inserts, record deletes, and key-range scans. Transactions on the 
physical database are modelled as two-level search-tree transactions that on the 
higher level do index-term retrievals and data-term inserts, deletes and retrievals 
and that on the lower level consist of sm transactions as open nested subtran- 
sactions. The semantics of the search tree is used to derive isolation conditions 
that guarantee the structural consistency of the search tree, isolation of logical 
database operations, and recoverability of both the search-tree structure and 
the database operations. The isolation conditions can be enforced by standard 
key-range locking and page-latching protocols. 

General correctness and recoverability results are derived for search-tree 
transaction histories that contain any number of forward-rolling, committed, 
backward-rolling, and rolled-back transactions. In the case of specific B-tree 
structures and algorithms the general results serve as a basis for establishing 
more stringent results on the efficiency of transaction processing in the presence 
of failures. In this extended abstract, proofs of the results are omitted. 
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Abstract. The problem of answering queries using views has been stu- 
died extensively due to its relevance in a wide variety of data-manage- 
ment applications. In these applications, we often need to select a subset 
of views to maintain due to limited resources. In this paper, we show that 
traditional query containment is not a good basis for deciding whether 
or not a view should be selected. Instead, we should minimize the view 
set without losing its query-answering power. To formalize this notion, 
we first introduce the concept of “p-containment.” That is, a view set 
V is p-eontained in another view set W, if W can answer all the que- 
ries that can be answered by V. We show that p-containment and the 
traditional query containment are not related. We then discuss how to 
minimize a view set while retaining its query-answering power. We deve- 
lop the idea further by considering p-containment of two view sets with 
respect to a given set of queries, and consider their relationship in terms 
of maximally-contained rewritings of queries using the views. 



1 Introduction 

The problem of answering queries using views [2,3,9,13,17,22] has been studied 
extensively, because of its relevance to a wide variety of data management pro- 
blems, such as information integration, data warehousing, and query optimiza- 
tion. The problem can be stated as follows: given a query on a database schema 
and a set of views over the same schema, can we answer the query using only 
the answers to the views? Recently, Levy compiled a good survey [16] about the 
different approaches to this problem. 

In the context of query optimization, computing a query using previously 
materialized views can speed up query processing, because part of the compu- 
tation necessary for the query may have been done while computing the views. 
In a data warehouse, views can preclude costly access to the base relations and 
help answer queries quickly. In web-site designs, precomputed views can be used 
to improve the performance of web-sites [II]. Before choosing an optimal design, 
we must assure that the chosen views can be used to answer the expected que- 
ries at the web-site. A system that caches answers locally at the client can avoid 
accesses to base relations at the server. Cached result of a query can be thought 
of as a materialized view, with the query as its view definition. The client could 
use the cached answers from previous queries to answer future queries. 



J. Van den Bussche and V. Vianu (Eds.): ICDT 2001, LNCS 1973, pp. 99—113, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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However, the benefits presented by views are not without costs. Materialized 
views often compete for limited resources. Thus, it is critical to select views 
carefully. For instance, in an information-integration system [ 24 ], a view may 
represent a set of web pages at an autonomous source. The mediator [ 26 ] in 
these systems often needs to crawl these web pages periodically to refresh the 
cached data in its local repository [8]. In such a scenario, the cost manifests 
itself as the bandwidth needed for such crawls and the efforts in maintaining the 
cache up-to-date. Correspondingly, in a query-optimization and database-design 
scenario, the materialized views may have part of the computation necessary 
for the query. When a user poses a query, we need to decide how to answer 
the query using the materialized views. By selecting an optimal subset of views 
to materialize, we can reduce the computation needed to decide how to answer 
typical queries. In a client-server architecture with client-side caching, storing 
all answers to past queries may need a large storage space and will add to the 
maintenance costs. Since the client needs to deal with an evolving set of queries, 
any of these can be used to answer future queries. Thus, redundant views need 
not be cached. 

The following example shows that views can have redundancy to make such 
a minimization possible, and that traditional query containment is not a good 
basis for deciding whether a view should be selected or not. Instead, we should 
consider the query- answering power of the views. 

Example 1 . Suppose we have a client-server system with client-side caching for 
improving performance, since server data accesses are expensive. The server has 
the following base relation about books: 

book{Title, Author, Pub, Price) 

For example, the tuple (databases, smith, prenhall, $ 60 ) in the relation means 
that a book titled databases has an author smith, is published by Prentice 
Hall (prenhall), and has a current price of $ 60 . Assume that the client has seen 
the following three queries, the answers of which have been cached locally. The 
cached data (or views), denoted by the view set V = {Pi, V2, 1^}, are: 

Pi: Vi{T,A,P) :- book{T,A,B,P) 

P2: V2{T, A, P) book{T, A, prenhall, P) 

P3: vz{Ai,A2) :- book{T, Ai, prenhall, Pi), book{T, A2, prenhall, P2) 

The view Pi has title-author-price information about all books in the rela- 
tion, while the view P2 includes this information only about books published by 
Prentice Hall. The view P3 has coauthor pairs for books published by Prentice 
Hall. Since the view set has redundancy, we might want to eliminate a view to 
save costs of its maintenance and storage. At the same time, we want to be assu- 
red that such an elimination does not cause increased server accesses in response 
to future queries. 

Clearly, view P2 is contained in Pi, i.e.. Pi includes all the tuples in P2, so 
we might be tempted to select {Pi, P3}, and eliminate P2 as a redundant view. 
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However, with this selection, we cannot answer the query: 

Qi : qi{T,P) book{T, smith, prenhall, P) 

which asks for titles and prices of books written by smith and published by 
prenhall. The reason is that even though V\ includes title-author-price infor- 
mation about all books in the base relation, the publisher attribute is projected 
out in the view’s head. Thus, using Vi only, we cannot tell which books are pu- 
blished by prenhall. On the other hand, the query Qi can be answered trivially 
using I2: 

Pi : qi{T,P) :- V2{T, smith, P) 

In other words, by dropping V2 we have lost some power to answer queries. In 
addition, note that even though view V3 is not contained in Vi and V2, it can 
be eliminated from V without changing the query-answering power of V. The 
reason is that V3 can be computed from V2 as follows: 



1 '3(^l)^2) V2{T, Ai, Pi),V2{T, A2, P2) 

To summarize, we should not select {Vi, V3} but {Id, V2}, even though the former 
includes all the tuples in V, while the latter does not. The rationale is that 
the latter is as “powerful” as V while the former is not. Caution: One might 
hypothesize from this example that only projections in view definitions cause 
such a mismatch, since we do not lose any “data” in the body of the view. We 
show in Section 3 that this hypothesis is wrong. 

In this paper we discuss how to minimize a view set without losing its query- 
answering power. We first introduce the concept of p-containment between two 
view sets, where “p” stands for query-answering power (Sections 2 and 3). A 
view set V is p-contained in another view set W, or W is at least as powerful as 
V, if W can answer all the queries that can be answered using V. Two view sets 
are called equipotent if they have the same power to answer queries. As shown 
in Example 1, two view sets may have the same tuples, yet have different query- 
answering power. That is, traditional view containment [6,23] does not imply 
p-containment. The example further shows that the reverse direction is also not 
implied. In Section 3.2 we show that given a view set V on base relations, how 
to find a minimal subset of V that is equipotent to V. As one might suspect, a 
view set can have multiple equipotent minimal subsets. 

In some scenarios, users are restricted in the queries they can ask. In such 
cases, equipotence may be determined relative to the expected (possibly infinite) 
set of queries. In Section 4, we investigate the above questions of equipotence 
testing given this extra constraint. In particular, we consider infinite query sets 
defined by finite parameterized queries, and develop algorithms for testing this 
relative p-containment. 

In information-integration systems, we often need to consider not only equi- 
valent rewritings of a query using views, but also maximally-contained rewritings 
(MCR’s). Analogous to p-containment, which requires equivalent rewritings, we 
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introduce the concept of MCR- containment that is defined using maximally- 
contained rewritings (Section 5). Surprisingly, we show that p-containment im- 
plies MCR-containment, and vice-versa. 

The containments between two finite sets of conjunctive views discussed in 
this paper are summarized in Table 1. In the full version of the paper [19] we 
discuss how to generalize the results to other languages, such as conjunctive 
queries with arithmetic comparisons, unions of conjunctive queries, and datalog. 



Table 1. Containments between two finite sets of conjunctive views: V and W. 



Containment 


Definition 


How to test 


v-containment 
V E« W 


For any database, a tuple in a 
view in V is in a view in W. 


Check if each view in V is 
contained in some view in W. 


p-containment 
V Ep w 


If a query is answerable by V, 
then it is answerable by W. 


Check if each view in V is 
answerable by W. 


relative 
p-containment 
V AqW 


For each query Q in a given set 
of queries Q, if Q is answerable 
by V, then Q is answerable by 

W. 


Test by the definition if Q is 
finite. See Section 4.2 for 
infinite queries defined by 
parameterized queries. 


MCR-containment 

V <MCR W 


For each query Q, for any 
maximally-contained rewriting 
MCR{Q,V) (resp. MCR{Q,W)) 
of Q using V (resp. W), 
MCR{Q,V) E MCR{Q,W). 


Same as testing if V Ep W, 
since V Ep W V Amcr W. 



2 Background 

In this section, we review some concepts about answering queries using views 
[17]. Let ri, . . . be m base relations in a database. We first consider queries 
on the database in the following conjunctive form: 

h{X) :- g,{X,),...,gk{Xk) 

In each subgoal gi{Xi), predicate gt is a base relation, and every argument in the 
subgoal is either a variable or a constant. We consider views defined on the base 
relations by safe conjunctive queries, i.e., every variable in a query’s head appears 
in the body. Note that we take the closed- world assumption [1], since the views 
are computed from existing database relations. We shall use names beginning 
with lower-case letters for constants and relations, and names beginning with 
upper-case letters for variables. 

Definition 1. (query eontainment and equivalence) A query Qi is contained in 
a query Qi, denoted by Qi C Q 2 , if for any database D of the base relations, 
Qi{D) C Q 2 {D). The two queries are equivalent if Qi Q Q 2 and Q 2 E Qi- 




Minimizing View Sets without Losing Query- Answering Power 103 



Definition 2. (expansion of a query using views) The expansion of a query P 
on a set of views V, denoted by is obtained from P by replacing all the views 

in P with their corresponding base relations. Existentially quantified variables in 
a view are replaced by fresh variables in P^^p , 



Definition 3. (rewritings and equivalent rewritings) Given a query Q and a 
view set V, a query P is a rewriting of query Q using V if P uses only the views 
in V, and P^^p \Z Q. P is an equivalent rewriting of Q using V if P'^^p and Q 
are equivalent. We say a query Q is answerable by V if there exists an equivalent 
rewriting of Q using V. 

In Example 1, Pi is an equivalent rewriting of the query Qi using view V 2 , 
because the expansion of Pi: 

Pf^P : qi(T,P) book{T, smith, prenhall, P) 

is equivalent to Qi- Thus, query Qi is answerable by V 2 , but it is not answerable 
by {Vi, Val- 
in this paper we consider finite view sets. Several algorithms have been de- 
veloped for answering queries using views, such as the bucket algorithm [18,12], 
the inverse-rule algorithm [22,10], and the algorithms in [20,21]. See [1,17] for a 
study of the complexity of answering queries using views. In particular, it has 
been shown that the problem of rewriting a query using views is AfP-complete. 

3 Comparing Query- Answering Power of View Sets 

In this section we first introduce the concept of p- containment, and compare it 
with traditional query containment. Then we discuss how to minimize a view set 
without losing its query-answering power with respect to all possible queries. 

Definition 4. (p- containment and equipotence) A view set V is p-contained in 
another view set W, or “W is at least as powerful as V, ” denoted by V <p 
W, if any query answerable by V is also answerable by W. Two view sets are 
equipotent, denoted by V Xp W, if V dip W, and W dp V. 

In Example 1, the two view sets {Vi,V 2 } and {Ei,V 2 jV 3 | are equipotent, 
since the latter can answer all the queries that can be answered by the former, 
and vice-versa. (We will give a formal proof shortly.) However, the two view 
sets, {Ei,V 3 } and {Vi,V 2 ,V 3 }) are not equipotent, since the latter can answer 
the query Qi, which cannot be answered by the former. The following lemma 
suggests an algorithm for testing p-containment. 

Lemma 1. Let V and W be two view sets. V dpV^ iff for every view V €V, if 
treated as a query, V is answerable by 

^ Due to space limitations, we do not provide all the proofs of the lemmas and theo- 
rems. Some proofs are given in the full version of the paper [19]. 
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The importance of this lemma is that we can test V W simply by checking 
if every view in V is answerable by W. That is, we can just consider a finite set of 
queries, even though V :<p W means that W can answer all the infinite number 
of queries that can be answered by V. We can use the algorithms in [10,12,18, 
22] to do the checking. It is easy to see that the relationship “^p” is reflexive, 
antisymmetric, and transitive. Using the results of [17] for the complexity of 
testing whether a query is answerable by a set of views, we have: 

Theorem 1. The problem of whether a view set is p-contained in another view 
set is MV-hard. 



Example 2. As we saw in Example 1, view V 3 is answerable by view ¥ 2 - By 
Lemma 1, we have {Ui,V 2 ,U 3 } ^p {Vi,V 2 }. Clearly the other direction is also 
true, so {Ui,V 2 } ><p {Vi,V 2 ,V 3 }. On the other hand, V 2 cannot be answered 
using {Ui, U3}, which means {Ul, U2, V3} :^p {Vi, U3}. 



3.1 Comparing P-Containment and Traditional Query Containment 

We are interested in the relationship between p-containment and the traditional 
concept of query containment (as in Definition 1). Before making the compari- 
sons, we first generalize the latter to a concept called v-containment to cover the 
cases where the views in a set have different schemas. 

Definition 5. (v-containment and v- equivalence) A view set V is v-contained 
in another view set W, denoted by V Qv kV, if the following holds. For any 
database D of the base relations, if tuple t is in V {D) for a view V € V, then 
there is a view W € W, such that t G W{D). The two sets are v-equivalent, if 
V W, and W V. 

In Example 1, the two view sets {Ui,V 2 ,V 3 } and {UijVs} are v-equivalent, 
while their views have different schemas. The example shows that v-containment 
does not imply p-containment, and vice-versa. One might guess that if we do 
not allow projections in the view definitions (i.e., all the variables in the body 
of a view appear in the head), then v-containment could imply p-containment. 
However, the following example shows that this guess is incorrect. 

Example 3. Let e(Ai, A 2 ) be a base relation, where a tuple e{x,y) means that 
there is an edge from vertex x to vertex y in a graph. Consider two view sets: 

V = {Ui}, Up vM,B,C) e{A,B),e{B,C),e{A,C) 

W = {Wi}, Wi: wi(A, B, C) :- e(A, B),e(B, C) 

As illustrated by Figure 1, view Ui stores all the subgraphs shown in Figure 1(a), 
while view lUi stores all the subgraphs shown in Figure 1(b). The two views do 
not have projections in their definitions, and V Qv kV. However, V -^p kV, since 
Vi cannot be answered using lUi. 




Minimizing View Sets without Losing Query- Answering Power 105 



Subgraph 1 




o »>e -o 

ABC 



Subgraph 2 






K) 


B 


c 



(a) View Vi 



(b) View Wi 



Fig. 1. Diagram for the two views in Example 3 



The following example shows that p-containment does not imply v-contain- 
ment, even if the views in the sets have the same schemas. 

Example 4 - Let r{Xi,X2) and s(li, >2) be two base relations on which two view 
sets are defined: 

V={Vi}, Vi: vi(A,C) :-r(A,B),s(B,C) 

W = {Wi,W2},Wi-. wi{A,B) r{A,B) 

IV2: W2(B,C) s(B,C) 

Clearly V W, but V ^r, W, since there is a rewriting of Vi using W: 
Vi(A,C) Wi(A,B),iV2(B,C). 

3.2 Finding an Equipotent Minimal Subset 

In many applications, each view is associated with a cost, such as its storage 
space or the number of web pages that need to be crawled for the view [8]. We 
often need to find a subset of views that has the same query-answering power. 

Definition 6. (equipotent minimal subsets) A subset M of a view set V is an 
equipotent minimal subset (EMS for short) ofV if M Xp V, and for any V € 
M:M-{V} ^pV. 



Informally, an equipotent minimal subset of a view set V is a minimal subset 
that is as powerful as V. For instance, in Example 1, the view set {Vi, V2} is an 
EMS of {Vi, V2, V3}. We can compute an EMS of a view set V using the following 
Shrinking algorithm. 

Algorithm Shrinking initially sets M = V. For each view V € M, it 
checks if V is answerable by the views M — {V}. If so, it removes V from 
M. It repeats this process until no more views can be removed from M, 
and returns the resulting M as an EMS of V. 



Example 5 . This example shows that, as suspected, a view set may have multiple 
EMS’s. Suppose r{A, B) is a base relation, on which the following three views 
are defined: 



El : VI (A) :-r{A,B) 

V2:v 2{B) :-r{A,B) 

Vr.V3{A,B) :-r{A,X),r{Y,B) 
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Let V = {Vi, V2, V3}. Then V has two EMS’s: {Vi, V2}, and {V3}, as shown by 
the following rewritings: 

rewrite Vi using V3: Vi{A) v^{A,B) 

rewrite V2 using V3: V2{B) v^{A,B) 

rewrite V3 using {Vi,V2}: vz{A,B) vi{A),V2{B) 

We often want to find an EMS such that the total cost of selected views is 
minimum. We believe that the problem of finding an optimal EMS efficiently 
deserves more investigations. 

4 Testing P-Containment Relative to a Query Set 

Till now, we have considered p-containment between two view sets with respect 
to a “universal” set of queries, i.e., users can ask any query on the base relations. 
However, in some scenarios, users are restricted in the queries they can ask. In 
this section, we consider the relationship between two view sets with respect to 
a given set of queries. In particular, we consider infinite query sets defined by 
finite parameterized queries. 

Definition 7. (relative p-eontainment) Given a (possibly infinite) set of queries 
Q, a view set V is p-contained in a view set W w.r.t. Q, denoted hyVfi^Q W, 
iff for any query Q G Q that is answerable by V, Q is also answerable by W. 
The two view sets are equipotent w.r.t. Q, denoted byV^n W, ifV :<n kV and 

W V. 

Example 6. Assume we have relations car{Make, Dealer) and loc{Dealer, City) 
that store information about cars, their dealers, and the cities where the dealers 
are located. Consider the following two queries and three views: 

Queries: Qi: qi{D,C) car{toyota, D),loc{D,C) 

Q2'. q2{D,C) car{honda, D),loc{D,C) 

Views: Wi'.wi{D,C) car{toyota, D),loc{D,C) 

IT2: W2{D, C) car{honda, D), loc{D, C) 

IV3: ws(M, D, C) car{M, D),loc{D, C) 

Let Q = {<5i, Q2}) V = {LTi, IT2}, and W = {W3}. Then V and >V are equipo- 
tent w.r.t. Q, since Qi and Q2 can be answered by V as well as W. Note that V 
and W are not equipotent in general. 

Given a view set V and a query set Q, we define an equipotent minimal subset 
(EMS) of V w.r.t. Q in the same manner as Definition 6. We can compute an 
EMS of V w.r.t. Q in the same way as in Section 3.2, if we have a method to 
test relative p-containment. This testing is straightforward when Q is finite; i.e., 
by definition, we can check for each query Qi G Q that is answerable by V, 
whether Qi is also answerable by W. However, if Q is infinite, testing relative 
p-containment becomes more challenging, since we cannot use this enumerate- 
and-test paradigm for all the queries in Q. In the rest of this section we consider 
ways to test relative p-containment w.r.t. infinite query sets defined by finite 
parameterized queries. 
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4.1 Parameterized Queries 

A parameterized query is a conjunctive query that contains placeholders in the 
argument positions of its body, in addition to constants and variables. A place- 
holder is denoted by an argument name beginning with a sign. 

Example 7. Consider the following parameterized query Q on the two relations 
in Example 6: 

Q : q{D) car{$M, D),loc{D,$C) 

This query represents all the following queries: a user gives a car make m for the 
placeholder $M, and a city c for the placeholder $C, and asks for the dealers of 
the make m in the city c. For example, the following are two instances of Q: 

I\\ q{D) car{toyota,D),loc{D,sf) 

l2- q{D) car{honda,D),loc{D,sf) 

which respectively ask for dealers of Toyota and Honda in San Francisco (s/) . 

In general, each instance of a parameterized query Q is obtained by assigning 
a constant from the corresponding domain to each placeholder. If a placeholder 
appears in different argument positions, then the same constant must be used 
in these positions. Let IS{Q) (resp. IS{Q)) denote the set of all instances of 
the query Q (resp. a query set Q). We assume that the domains of placeholders 
are infinite (independent of an instance of the base relations), causing IS{Q) to 
be infinite. Thus we can represent an infinite set of queries using a finite set of 
parameterized queries. 

Example 8. Consider the following three views: 

Vi: Vi(M,D,C) car(M, D), loc(D, C) 

V2'.V2(M,D) car{M, D),loc{D, sf) 

V3: vsIm) car{M, D),loc{D, sf) 

Clearly, view Vi can answer all instances of Q in Example 7, since it includes 
information for cars and dealers in all cities. View V2 cannot answer all instances, 
since it has only the information about dealers in San Francisco. But it can 
answer instances of the following more restrictive parameterized query, which 
replaces the placeholder $C by sf: 

Q' : q{D) car{%M, D),loc{D, sf) 

That is, the user can only ask for information about dealers in San Francisco. 
Finally, view V3 cannot answer any instance of Q, since it does not have the 
Dealer attribute in its head. 

Given a finite set of parameterized queries Q and two view sets V and W, 
the example above suggests the following strategy of testing V dijs(Q) bV: 

1. Deduce all instances of Q that can be answered by V. 

2. Test if W can answer all such instances. 




108 



C. Li, M. Bawa, and J.D. Ullman 



In the next two subsections we show how to perform each of these steps. 
We show that all answerable instances of a parameterized query for a given 
view set can be represented by a finite set of parameterized queries. We give an 
algorithm for deducing this set, and an algorithm for the second step. Although 
our discussion is based on one parameterized query, the results can be easily 
generalized to a finite set of parameterized queries. 



4.2 Complete Answerability of a Parameterized Query 

We first consider the problem of testing whether all instances of a parameterized 
query Q can be answered by a view set V. If so, we say that Q is completely 
answerable by V. 

Definition 8 . (canonical instance) A canonical instance of a parameterized 
query Q (given a view set V) is an instance of Q, in which each placeholder 
is replaced by a new distinct constant that does not appear in Q and V. 



Lemma 2. A parameterized query Q is completely answerable by a view set V 
if and only ifV can answer a canonical instance of Q (given V). 

The lemma suggests an algorithm TestComp for testing whether all instances 
of a parameterized query Q can be answered by a view set V. 

Algorithm TestComp first constructs a canonical instance Qc of Q (given 
V) . Then it tests if Qc can be answered using V by calling an algorithm of 
answering queries using views, such as those in [10,12,18,22]. It outputs 
“yes” if V can answer Qc, otherwise, it outputs “no.” 



Example 9. Consider the parameterized query Q in Example 8 . To test whether 
view Vi can answer all instances of Q, we use two new distinct constants mo 
and Co to replace the two placeholders $M and $C, and obtain the following 
canonical instance: 



Qc ■ q{D) car{mo, D),Ioc{D,cq) 

Clearly Qc can be answered by view V\, because of the following equivalent 
rewriting of Qc- 

Pc : q{D) Wi(mo,D, Cq) 

By Lemma 2, view V\ can answer all instances of Q. In addition, since V 2 cannot 
answer Qc (which is also a canonical instance of Q given V 2 ), it cannot answer 
some instances of Q. The same argument holds for V 3 . 
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4.3 Partial Answerability of a Parameterized Query 

As shown by view V2 and query Q in Example 8, even if a view set cannot 
answer all instances of a parameterized query, it can still answer some instances. 
In general, we want to know what instances can be answered by the view set, and 
whether these instances can also be represented as a set of more “restrictive” 
parameterized queries. A parameterized query Qi is more restrictive than a 
parameterized query Q if every instance of Q\ is also an instance of Q. For 
example, query q{D) car{$M, D),loc{D, sf) is more restrictive than query 
q{D) car{$M, D),loc{D,$C), since the former requires the second argument 
of the loc subgoal to be sf , while the latter allows any constant for placeholder 
$C. For another example, query q{M,C) car{M,$Di),loc{$Di,C) is more 
restrictive than query q{M, C) car{M, $Di),loc{$D2, C), since the former has 
one placeholder in two argument positions, while the latter allows two different 
constants to be assigned to its two placeholders. 

All the parameterized queries that are more restrictive than Q can be gene- 
rated by adding the following two types of restrictions: 

1. Type I: Some placeholders must be assigned the same constant. Formally, let 
{$Ai, . . . ,$Afc} be some placeholders in Q. We can put a restriction $Ai = 
• • • = on the query Q. That is, we can replace all these k placeholders 
with any of them. 

2. Type II: For a placeholder in Q and a constant c in Q or V, we put a 
restriction = c on Q. That is, the user can only assign constant c to this 
placeholder in an instance. 

Consider all the possible (finite) combinations of these two types of restric- 
tions. For example, suppose Q has two placeholders, {$Ai,$A2}, and Q and V 
have one constant c. Then we consider the following restriction combinations: 
{}, {$Ai = $^2}, {$Ai = c}, {$^2 = c}, and {$Ai = $^2 = c}. Note that 
we allow a combination to have restrictions of only one type. In addition, each 
restriction combination is consistent, in the sense that it does not have a restric- 
tion $Ai = $^2 and two restrictions $Ai = ci and $^2 = C2, while ci and C2 are 
two different constants in Q and V. For each restriction combination RCi, let 
Q{RCi) be the parameterized query that is derived by adding the restrictions 
in RCi to Q. Clearly Q{RCi) is a parameterized query that is more restrictive 
than Q. Let (t>{Q,V) denote all these more restrictive parameterized queries. 

Suppose / is an instance of Q that can be answered by V. We can show 
that there exists a parameterized query Qi G 'P(Q, V), such that / is a canonical 
instance of Qi. By Lemma 2, Qi is completely answerable by V. Therefore, we 
have proved the following theorem: 

Theorem 2. All instances of a parameterized query Q that are answerable by 
a view set V can be generated by a finite set of parameterized queries that are 
more restrictive than Q, such that all these parameterized queries are completely 
answerable by V. 
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We propose the following algorithm GenPartial. Given a parameterized query 
Q and a view set V, the algorithm generates all the parameterized queries that 
are more restrictive than Q, such that they are completely answerable by V, and 
they define all the instances of Q that are answerable by V. 

Algorithm GenPartial first generates all the restriction combinations, and 
creates a parameterized query for each combination. Then it calls the al- 
gorithm TestGomp to check if this parameterized query is completely 
answerable by V. It outputs all the parameterized queries that are com- 
pletely answerable by V. 



4.4 Testing P-Containment Relative to Finite Parameterized 
Queries 

Now we give an algorithm for testing p-containment relative to parameterized 
queries. Let Q be a query set with only one parameterized query Q. Let V and W 
be two view sets. The algorithm tests V LV as follows. First call the algo- 

rithm GenPartial to find all the more restrictive parameterized queries of Q that 
are completely answerable by V. For each of them, call the algorithm TestGomp 
to check if it is also completely answerable by >V. By Theorem 2, V W 

iff all these parameterized queries that are completely answerable by V are also 
completely answerable by W. The algorithm can be easily generalized to the 
case where Q is a finite set of parameterized queries. 



5 MCR-Containment 

So far we have considered query-answering power of views with respect to equi- 
valent rewritings of queries. In information-integration systems, we often need 
to consider maximally-contained rewritings. In this section, we introduce the 
concept of MCR- containment, which describes the relative power of two view 
sets in terms of their maximally-contained rewritings of queries. Surprisingly, 
MCR-containment is essentially the same as p-containment. 

Definition 9. (maximally- contained rewritings) A maximally-contained rewrit- 
ing P of a query Q using a view set V satisfies the following conditions: (1) P 
is a finite union of conjunctive queries using only the views in V; (2) For any 
database, the answer computed by P is a subset of the answer to Q; and (3) 
No other unions of conjunctive queries that satisfy the two conditions above can 
properly contain P. 

Intuitively, a maximally-contained rewriting (henceforth “MCR” for short) 
is a plan that uses only views in V and computes the maximal answer to query 
Q. If Q has two MCR’s, by definition, they must be equivalent as queries. If Q 
is answerable by V, then an equivalent rewriting of Q using V is also an MCR 
of Q. 
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Example 1 0. Consider the following query Q and view V on the two relations in 
Example 6: 

Q: q{M, D, C) car{M, D), loc{D, C) 

V: v{M, D, sf) car{M, D), loc{D, sf) 

Suppose we have the access to view V only. Then q{M,D,sf) v{M,D,sf) is 
an MCR of the query Q using the view V. That is, we can give the user only 
the information about car dealers in San Francisco as an answer to the query, 
but not anything more. 



Definition 10. (MCR- containment) A view set V is MCR-contained in another 
view set W, denoted by V :<mcr bV, if for any query Q, we have MCR{Q,V) C 
MCR{Q, W), where MCR{Q, V) and MCR{Q, W) are MCR ’s ofQ using V and 
W, respectively f The two sets are MCR-equipotent, denoted by V ^mcr bV, if 
V <mcr bV, and bV <mcr V- 

Surprisingly, MCR-containment is essentially the same as p-containment. 

Theorem 3. For two view sets V and bV, V dip yV if and only ifV diMCR bV. 

Proof. “If”: Suppose V duCR bV. Consider each view V S V. Clearly V itself 
is an MCR of the query V using V, since it is an equivalent rewriting of V. 
Let MCR{V,yV) be an MCR of V using W. Since V dmcR bV, we have V C 
MCR{V,W). On the other hand, by the definition of MCR’s, MCR(V,W) Q 
V. Thus MCR{V,W) and V are equivalent, and MCR{V,yV) is an equivalent 
rewriting of V using W. By Lemma 1, V dp bV. 

“Only if”: Suppose V dp bV. By Lemma 1, every view has an equivalent 
rewriting using >V. For any query Q, let MCR{Q, V) and MCR{Q, bV) be MCR’s 
of Q using V and >V, respectively. We replace each view in MCR{Q,V) with its 
corresponding rewriting using W, and obtain a new rewriting MCR' of query Q 
using W, which is equivalent to MCR{Q, V). By the definition of MCR’s, we have 
MCR' C MCR{Q,W). Thus MCR{Q,V) E MCR{Q,W), and V dncR bV. 

6 Related Work 

There has been a lot of work on the problem of selection of views to materialize 
in a data warehouse. In [5,14,15,25], a data warehouse is modeled as a repository 
of integrated information available for querying and analysis. A subset of queries 
are materialized to improve responsiveness, and base relations are accessed for 
the rest of the queries. Base relations can change over time, and the query 
results can be huge, resulting in costs for maintenance and space. The study in 
this setting has, therefore, emphasized on modeling the view-selection problem 
as cost-benefit analysis. For a given set of queries, various sets of sub-queries are 

^ We extend the query-containment notation in Definition 1 to unions of conjun- 
ctive queries in the obvious way. 
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considered for materialization. Redundant views in a set that increase costs are 
deduced using query containment, and an optimal subset is chosen. 

Such a model is feasible when all queries can be answered in the worst case by 
accessing base relations, and not by views alone. This assumption is incorporated 
in the model by replicating base relations at the warehouse. Thus, the base 
relations themselves are considered to be normalized, independent, and minimal. 
However, when real-time access to base relations is prohibitive, such an approach 
can lead to wrong conclusions, as was seen in Example 1. In such a scenario, it 
is essential to ensure the computability of queries using only maintained views. 

Our work is directed towards scenarios where the following assumptions hold. 
(1) Real-time access to base relations is prohibitive, or possibly denied, and (2) 
cached views are expensive to maintain over time, because of the high costs of 
propagating changes from base relations to views. Therefore, while minimizing 
a view set, it is important to retain its query- answering power. We believe the 
power of answering queries and the benefit/costs of a view set are orthogonal 
issues, and their interplay would make an interesting work in its own right. 

The term “query-answering” has been used in [4] to mean deducing tuples 
that satisfy a query, given the view definitions and their extensions. In our fra- 
mework, this term stands for the ability of a set to answer queries. Another 
related work is [13] that studies information content of views. It develops a con- 
cept subsumption between two sets of queries, which is used to characterize their 
capabilities of distinguishing two instances of a database. 

Recently, [7] has proposed solutions to the following problem: given a set 
of queries on base relations, which views do we need to materialize in order to 
decrease the query answering time? The authors show that even for conjunctive 
queries and views only, there can be an infinite number of views that can answer 
the same query. At the same time, the authors show that the problem is decida- 
ble: for conjunctive queries and views, it is enough to consider a finite space of 
views where all views are superior, in terms of storage space and query answe- 
ring time, to any other views that could answer the given queries. The problem 
specification in that paper is different from ours: they start with a set of given 
queries and no views. In our framework, we assume that a set of views are given, 
and queries can be arbitrary. We would like to deduce a minimal subset of views 
that can answer all queries answerable by the original set. 

Currently we are working on some open problems in our framework, including 
ways to find an optimal EMS of a view set efficiently, to find a v-equivalent 
minimal subset of a view set efficiently, and to find cases where v-containment 
can imply p-containment, and vice-versa. 
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Abstract. We consider the problem of data dissemination in a broad- 
cast network. In contrast to previously studied models, broadcasting is 
among peers, rather than client server. Such a model represents, for ex- 
ample, satellite communication among widely distributed nodes, sensor 
networks, and mobile ad-hoc networks. We introduce a cost model for 
data dissemination in peer to peer broadcast networks. The model quan- 
tifies the tradeoff between the inconsistency of the data, and its transmis- 
sion cost; the transmission cost may be given in terms of dollars, energy, 
or bandwidth. Using the model we first determine the parameters for 
which eager (i.e. consistent) replication has a lower cost than lazy (i.e. 
inconsistent) replication. Then we introduce a lazy broadcast policy and 
compare it with several naive or traditional approaches to solving the 
problem. 



1 Introduction 

A mobile computing problem that has generated a significant amount of interest 
in the database community is data broadcasting (see for example [19]). The 
problem is how to organize the pages in a broadcast from a server to a large 
client population in the dissemination of public information (e.g. electronic news 
services, stock-price information, etc.). A strongly related problem is how to 
replicate (or cache) the broadcast data in the Mobile Units that receive the 
broadcast. 

In this paper we study the problems of broadcasting and replication in a 
peer to peer rather than client server architecture. More precisely, we study the 
problem of dissemination, i.e. full replication at all the nodes in the system. This 
architecture is motivated by new types of emerging wireless broadcast networks 
such as Mobile Ad-hoc Networks (see [6])^, sensor and ’’smart dust” networks 
([12]), and satellite networks. These networks enable novel applications in which 

* This research was supported in part by Army Research Labs grant DAALOl-96-2- 
0003, DARPA grant N66001-97-2-8901, NSF grants CCR-9816633, CCR-9803974, 
IRI-9712967, EIA-0000516, and INT-9812325. 

^ A Mobile Ad-hoc Network (MANET) is a system of mobile computers (or nodes) 
equipped with wireless broadcast transmitters and receivers which are used for com- 
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the nodes of a network collaborate to assemble a complete database. For in- 
stance, in the case of sensors that are parachuted or sprayed from an airplane, 
the database renders a global picture of an unknown terrain from local images 
collected by individual sensors. Or, the database consists of the current location 
of each member in a military unit (in a MANET case), or another meaningful 
database constructed from a set of widely distributed fragments. 

We model such applications using a ’’master” replication environment (see 
[10]), in which each node i ’’owns” the master copy of a data item Di, i.e. it 
generates all the updates to Di. For example, Di may be the latest in a sequence 
of images taken periodically by the node i of its local surroundings. Each new 
image updates Di. Or, Di may be the location of the node which is moving; 
Di is updated when the Global Positioning System (GPS) on board the node i 
indicates a current location that deviates from Di by more than a prespecified 
threshold. The database of interest is D = {Di,...,Dn}, where n is the number 
of nodes and also the number of items in the database.^ 

It is required that D is accessible from each node in the network,^ thus 
each node stores a (possibly inconsistent) copy of D. ^ Our paper deals with 
various policies of broadcasting updates of the data items. In each broadcast 
a data item is associated with its version number, and a node that receives a 
broadcasted data item updates its local database if and only if the local version 
is older than the newly arrived version. In the broadcast policies there is a 
tradeoff between data consistency and communication cost. In satellite networks 
the communication cost is in terms of actual dollars the customer is charged by 
the network provider; in sensor networks, due to the small size of the battery, the 
communication cost is in terms of energy consumption for message transmission; 
and in MANET’s the critical cost component is bandwidth (see [6]). Bandwidth 
for (secure) communication is an important and scarce resource, particularly in 
military applications (see [17]). 

Now let us discuss the broadcast policies. One obvious policy is the following: 
for each node i, when Di is updated, node i broadcasts the new version of 
Di to the other nodes in the network. We call this the Single-item Broadcast 

municating within the system. Such networks provide an attractive and inexpensive 
alternative to the cellular infrastructures when this infrastructure is unavailable (e.g. 
in remote and disaster areas), or inefficient, or too expensive to use. Mobile Ad-hoc 
Networks are used to communicate among the nodes of a military unit, in rescue 
and disaster relief operations, in collaborative mobile data exchange (e.g the set of 
attendees at a conference), and other ’’micronetworking” technologies ([14]). 

^ In case Di is the location of i, the database D is of interest in what are called 
Moving Objects Database (MOD) applications (see [13]). If Di is the location of 
object i in a battlefield situation, then a typical query may be: retrieve the friendly 
helicopters that are in a given region. Other MOD applications involve emergency 
(hre, police) vehicles and local transportation systems (e.g. city bus system). 

® For example, the location of the members of a platoon should be viewable by any 
member at any time. 

By inconsistency of D we mean that some data items may not contain the most 
recent version. 
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Dissemination (SBD) policy. In the networks and applications we discuss in 
this paper, nodes may be disconnected, turned off or out of battery. Thus the 
broadcast of Di may not be received by all the nodes in the system. A natural 
way to deal with this problem is to rebroadcast an update to Di until it is 
acknowledged by all the nodes, i.e. Reliable Broadcast Dissemination (RBD). 
Clearly, if the new version is not much different than the previous one and if the 
probability of reception is low (thus necessitating multiple broadcasts), then this 
increase in communication cost is not justified. An alternative option, which we 
adopt in SBD, is to broadcast each update once, and let copies diverge. Thus 
the delivery of updates is unreliable, and consequently the dissemination of Di 
is ’’lazy” in the sense that the copy of Di stored at a node may be inconsistent. 

How can we quantify the tradeoff between the increase in consistency afforded 
by a reliable broadcast and its increase in communication cost? In order to 
answer this question we introduce the concept of inconsistency-cost of a data 
item. This concept, in turn, is quantified via the notion of the cost difference 
between two versions of a data item Di. In other words, the inconsistency cost 
of using an older version v rather than the latest version w is the distance 
between the two versions. For example, if Di represents a location, then the cost 
difference between two versions of Di can be taken to be the distance between 
the two locations. If Di is an image, an existing algorithm that quantifies the 
difference between two images can be used (see for example [5]). If Di is the 
quantity-on-hand of a widget, then the difference between the two versions is 
the difference between the quantities. Now, in order to quantify the tradeoff 
between inconsistency and communication one has to answer the question: what 
amount of bandwidth/energy/dollars am I willing to spend in order to reduce 
the inconsistency cost on a data item by one unit? Using this model we establish 
the cost formulas for RBD and SBD, i.e reliable and unreliable broadcasting, 
and based on them formulas for selecting one of the two policies for a given set 
of system parameters. 

For the cases when unreliable broadcast, particularly SBD, is more appro- 
priate, consistency of the local databases can be enhanced by a policy that we 
call Full Broadcast Dissemination (FBD). In FBD, whenever Di is updated, i 
broadcasts its local copy of the whole database D, called D{i). In other words, i 
broadcasts Di, as well as its local version of each one of the other data items in 
the database. When a node j receives this broadcast, j updates its version of Di, 
and j also updates its local copy of each other item Dk, for which the version 
number in D{i) is more recent. Thus these indirect broadcasts of Dk (to j via 
i) are ’’gossip” messages that increase the consistency of each local database. 
However, again, this comes at the price of an increase in communication cost 
due to the fact that each broadcast message is n times longer. 

The SBD and FBD policies represent in some sense two extreme solutions on 
a consistency-communication spectrum of lazy dissemination policies. SBD has 
minimum communication cost and minimum local database consistency, whereas 
FBD has maximum communication cost and maximum (under the imperfect 
circumstances) local database consistency. 
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In this paper we introduce and analyze the Adaptive Broadcast Dissemina- 
tion (ABD) policy that optimizes the tradeoff between consistency and commu- 
nication using a cost based approach. In the ABD policy, when node i receives an 
update to Di it first determines whether the expected reduction in inconsistency 
justifies broadcasting a message. If so, then i ’’pads” the broadcast message that 
contains Di with a set S of data items (that i does not own) from its local 
database, such as to optimize the total cost. One problem that we solve in this 
paper is how to determine the set S, i.e. how node i should select for each bro- 
adcast message which data items from the local database to piggyback on Di. 
In order to do so, i estimates for each j and k the expected benefit (in terms 
of inconsistency reduction) to node k of including in the broadcast message its 
local version of Dj. 

Let us now put this paper in the context of existing work on consistency in 
distributed systems. Our approach is new as far as we know. Although gossi- 
ping has been studied extensively in distributed systems and databases (see for 
exampe [3,8]), none of the existing works uses an inconsistency-communication 
tradeoff cost function in order to determine what gossip messages to send. Fur- 
thermore, in the emerging resource constrained environments (e.g. sensor net- 
works, satellite communication, and MANET’s) this tradeoff is crucial. Also our 
notion of consistency is appropriate for the types of novel applications discussed 
in this paper, and is different than the traditional notion of consistency in dis- 
tributed systems discussed in the literature (e.g., [3]). Specifically, in contrast to 
the traditional approaches, our notion of consistency does not mean consistency 
of different copies of a data item at different nodes, and it does not mean mutual 
consistency of different data items at a node. In this paper a copy of a data item 
at a node is consistent if it has the latest version of the data item. Otherwise 
it is inconsistent, and the inconsistency cost is the distance between the local 
copy and the latest version of the data item. Inconsistency of a local database is 
simply the sum of the inconsistencies of all data items. We employ gossiping to 
reduce inconsistency, not to ensure consistency as in using vector clocks ([3]). 

In this paper we provide a comparative analysis of dissemination policies. 
The analysis is probabilistic and experimental, and it achieves the following 
objectives. First, it gives a formula for the expected total cost of SBD and 
RBD, and a complete characterization of the parameters for which each policy 
has a cost lower than the other. Second, for ABD we prove cost optimality 
for the set of data items broadcast by a node i, for i’s level of knowledge of the 
system state. Third, the analysis compares the three unreliable policies discussed 
above, namely SBD, FBD, and ABD, and a fourth traditional one called flooding 
(FLD)® [18]. ABD proved to consistently outperform the other two policies, 
often having a total cost (that includes the cost of inconsistency and the cost of 
communication) that is several times lower than that of the other policies. Due 
to space limitations, the comparison of the policies is omitted from this paper. 

In summary, the key contributions of this paper are as follows. 

® In flooding a node i broadcasts each new data item it receives either as a results of 
a local update of Di, or from a broadcast message. 
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— Introduction of a cost model to quantify the tradeoff between consistency 
and communication. 

— Analyzing the performance of eager and lazy dissemination via reliable and 
unreliable broadcasts respectively, obtaining cost formulas for each case and 
determining the data and communication parameters for which eager is su- 
perior to lazy, and vice versa. 

— Developing and analyzing the Adaptive Broadcast Dissemination policy, and 
comparing it to the other lazy dissemination policies. 

The rest of the paper is organized as follows. In section 2 we introduce the 
operational model and the cost model. In section 3 we analyze and compare 
reliable and unreliable broadcasting. In section 4 we describe the ABD policy, 
and in section 5 we analyze it. In section 6 we discuss relevant work. 



2 The Model 

In subsection 2.1 we precisely define the overall operational model, and in subs- 
ection 2.2 we define the cost model. 



2.1 Operational Model 

The system consists of a set of n nodes that communicate by message broad- 
casting. Each node i {1 < i < n) has a data item Di associated with it. Node 
i is called Di’s owner. This data item may contain a single numeric value, or a 
complex data structure such as a motion plan, or an image of the local environ- 
ment. Only i, and no other nodes, has the authorization to modify the state of 
Di. A data item is updated at discrete time points. Each update creates a new 
version of the data item. In other words, the kth version of Di, denoted Di{k), 
is generated by the fcth update. We denote the latest version of Di by Di. Furt- 
hermore, we use v{Di) to represent the version number of Di, i.e. v{Di(k)) = k. 
For two versions Di{k) and Di(k'), we say that Di{k) is newer than Di(k') if 
k > k', and Di{k) is older than Di(k') if k < k' . 

An owner i periodically broadcasts its data item Di to the rest of the system. 
Each such broadcast includes the version number of Di. Since nodes may be 
disconnected, some broadcasts may be missed by some nodes, thus each node j 
has a version of each Di which may be older than Di. The local database of node 
i at any given time is the set < D\,D\,..., D\, >, where each D* (for 1 < j < n) 
is a version of Dj. Observe that since all the updates of Di originate at i, then 
D\ = Di. Node i updates Dj {j yf i) in its local database when it receives a 
broadcast from j. 

Nodes may be disconnected (e.g. shut down) and thus miss messages. Let pi 
be the percentage of time a node i is connected. Then pi is also the probability 
that i receives a message from any other node j. For example, if i is connected 
60% of the time (i.e. pi = 0.6), then a message from j is received by i with 
probability 0.6. We call pi the connection probability of i. 
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2.2 Cost Model 

In this subsection we introduce a cost function that quantifies the tradeoff bet- 
ween consistency and communication. The function has two purposes. First, to 
enable determining the items that will be included in each broadcast of the ABD 
policy, and second, to enable comparing the various policies. 

Inconsistency cost. Assume that the distance between any two versions of 
a data item can be quantified. For example, in moving objects database (MOD) 
applications, the distance between two data item versions may be taken to be 
the Euclidean distance between the two locations. If Di is an image, one of 
the many existing distance functions between images (e.g. the cross-correlation 
distance ([5])) can be used. 

Formally, the distance between two versions Di{k) and Di{j), denoted 
DIST{Di{k), Di{j)), is a function whose domain is the nonnegative reals, and 
it has the property that the distance between two identical versions is 0. If the 
data item owned by each node consists of two or more types of logical objects, 
each with its own distance function, then the distance between the items should 
be taken to be the weighted averages of the pairwise distances. 

We take the DIST function to represent the cost, or the penalty, of using the 
older version rather than the newer one. More precisely, consider two consecutive 
updates on Di, namely the /cth update and the {k + l)st update. Assume that 
the kih update happened at time tk and the {k + l)st update at time tk+i- 
Intuitively, at time tfc+i each node j that did not receive the fcth version Di(k) 
during the interval \tk,tk+i)i P^^ys a price which is equal to the distance between 
the latest version of Di that j knows and Di{k). In other words, this price is the 
penalty that j pays for using an older version during the time in which j should 
have used Di{k). If j receives Di{k) sometime during the interval then 

the price that j pays on Di is zero. Formally, assume that at time tfc+i the latest 
version of Di that j knows is v {v < k). Then j’s inconsistency cost on version 
k ofD, is COSTJNCOj{D,{k)) = DIST{Di{v),D,{k)). 

The inconsistency cost of the system on Di{k) is COSTJNCO{Di{k)) = 

Y.,<j<nCOSTJNCO,mk)). 

The total inconsistency cost of the system on Di up to the mth update of Di, 
denoted COSTJNCO{i,m), is Ei<fc<m C'OS'T_J7VC'0(A(fc))- 

The total inconsistency cost for the system up to time t is COSTJNCOft) = 
J2i<i<nCOST_INCO{i,mi), where is the highest version number of Di at 
time t. 

Communication cost. The cost of a message depends on the length of the 
message. In particular, if there are m data items in a message, the cost of the 
message is Ci -I- m • C 2 . C\ is called the message initiation cost and C 2 is called 
the message unit cost. Ci represents the cost of energy consumed by the CPU 
to prepare and send the message. C 2 represents the incremental cost of adding a 
data item to a message. The values of Ci and C 2 are given in inconsistency cost 
units. They are determined based on the amount of resource that one is willing 
to spend in order to reduce the inconsistency cost on a version by one unit. For 
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example, if Ci = C 2 and one is willing to spend one message of one data item 
in order to reduce the inconsistency by at least 50, then Ci = C 2 = 1/100. 

The total communication cost up to time t is the sum of the costs of all the 
messages that have been broadcast from the beginning (time 0) until t. 

System cost. The system cost up to time t, denoted COST_SYS{t), is the 
sum of the total inconsistency for the system up to t, and the total communi- 
cation cost up to t. The system cost is the objective function optimized by the 
ABD policy. 



3 Reliable Versus Unreliable Broadcasting 



In this section we completely characterize the cases in which lazy dissemination 
by unreliable broadcasting outperforms eager dissemination by reliable broad- 
casting, and vice versa. Lazy dissemination is executed by the Single-item Bro- 
adcast Dissemination policy, in which each node i unreliably broadcasts each 
update it receives, when i receives it. Eager dissemination is executed by the 
Reliable Broadcast Dissemination (RBD) policy, in which each node i reliably 
broadcasts each update it receives, when i receives it; by reliable broadcast we 
mean that i retransmits the message until it is acknowledged by all the other 
nodes. Performance of the two policies is measured in terms of the system cost, 
as defined at the end of the previous section. We first derive the closed formu- 
las for the system costs of SBD and RBD. Then, based on these formulas, we 
compare SBD and RBD. 

In the following discussion, we assume that for each node i, the updates at 
i are generated by a Poisson process with intensity A^. Let A = X)i<i<n 
number of nodes in the system is n, the connection probability Pi for each node 
i, message initiation cost Ci, and the message unit cost C 2 . 

The following theorem gives the system cost of SBD up to a given point in 
time. 

Theorem 1 The system cost of SBD up to time t (i.e. COSTSYSsBoit)) is a 
random variable whose expected value is 



E[COST.SYSsBD{t)] = A • t • (Cl + C 2 ) + 



E 




l< 2 <n m—1 



■ (A ■ tr 

ml 



m—1 j^i 

q—l l<j<n 
q-l 

+ EU • (1 -UT”" • DIST{D,{k),D,{q)))) 

k=l 



□ 

Now we analyze the system cost of the reliable broadcast dissemination 
(RBD) policy. First let us introduce a lemma which gives the expected num- 
ber of times that a message is transmitted from node i (remember that in RBD 
a message is retransmitted until it is acknowledged by all the other nodes). 
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Lemma 1 Let Ri be the number of times that a message is transmitted. Then 
Ri is a random variable whose expected value is: 



oo i/i 

E[R,]=Y,{k-{ n (i-(i-p,)'=)- n 

k—i 

□ 

Theorem 2 The system cost of RED up to time t (i.e. COSTSY SaBoit)) is 
a random variable whose expected value is: 



E[COST_SYSBBD{t)] = (Cl + C 2 ) • t • • E[R,]) + (n - 1) • Ci • A • t 

i=l 

(the value of E[Ri] was derived in Lemma 1) □ 

Based on Theorems 1 and 2, we identify the situations in which SBD outper- 
forms RBD, and vice versa. But due to space limitations, this result is omitted. 



4 The Adaptive Broadcast Dissemination Policy 

In this section we describe the Adaptive Broadcast Dissemination policy. Intui- 
tively, a node i executing the policy behaves as follows. When it receives an 
update to Di, node i constructs a broadcast message by evaluating the benefit 
of including in the message each one of the data items in its local database. 
Specifically, the ABD policy executed by i consists of the following two steps. 

(1) Benefit estimation: For each data item in the local database, estimate 
how much the inconsistency of the system could be reduced if that data item is 
included in the message. 

(2) Message construction: Construct the message which is a subset of the local 
database so that the total estimated net benefit of the message is maximized (The 
net benefit is the difference between the inconsistency reduced by the message 
and the cost of the message) . Observe that the set of data items to be broadcast 
may be empty. In other words, when Di is updated, node i may estimate that 
the net benefit of broadcasting any data item is negative. 

Each one of the above steps is executed by an algorithm which is described 
in one of the next two subsections. 



4.1 Benefit Estimation 

Intuitively, the benefit to the system of including a data item Dj in a message 
that node i broadcasts is in terms of inconsistency reduction. This reduction 
depends on the nodes that receive the broadcast, and on the latest version of Dj 
at each one of these nodes. Node i maintains data structures that enable it to 
estimate the latest version of Dj at each node. Then the benefit of including a 




122 



B. Xu, O. Wolfson, and S. Chamberlain 



data item Dj in a message that i broadcasts is simply the sum of the expected 
inconsistency reductions at all the nodes. 

In computing the inconsistency reduction for a node k we attempt to be as 
accurate as possible, and we do so as follows. Node i maintains a ’’knowledge 
matrix” which stores in entry (k,j) the last version number of Dj that node i 
received from node k (this version is called v{Dj)), and the time when it was 
received. Additionally, i saves in the ’’real history” for each Dj all the versions 
of Dj that i has ’’heard” from other nodes, the times at which it has done so, 
and from which node they were received®. The reason for maintaining all this 
information is that now, in estimating which version of Dj node k has, node i 
can take into consideration two factors: (1) the last version of Dj that i received 
from k at time, say t, and (2) the fact that since time t node k may have received 
updates of Dj by ’’third party” messages that were transmitted after time t, and 
’’heard” by both, k and i. Node i also saves with each version v of Dj that 
it ’’heard”, the distance (i.e. the inconsistency caused by the version difference) 
between v and the last version of Dj that i knows; this difference is the parameter 
necessary in order to compute the inconsistency cost reduction that is obtained 
if node i broadcasts its latest version of Dj . 

In subsection 4. 1. 1 we describe the data structures that are used by a node i in 
benefit estimation. In subsection 4.1.2 we present i’s benefit estimation method. 



4.1.1 Data Structures 

(1) The Knowledge matrix: For each data item Dj {j ^ i), denote by v{Dj) 
the latest version number of Dj that i received from k, and denote by t{Dj) the 
last time when Dj was received at i. The knowledge matrix at node i is: 



Mi = 



/ it{Dl),v{Dl)) {t{Dl),v{Di)) ... {t{Di),v{Di))\ 
it{Dj),v{Dl)) it[Dl),v{Dl)) ...{t{Dl),v{Dl)) 



\{t{D^),v{Dr)) (t(Z?J),u(DJ)) ... (t(il”),u(D"))/ 



Node i updates the matrix whenever it receives a message. Specifically, when 
i receives a message from k that includes Dj, i updates the entry (k,j) of the 
matrix. In addition, if the version of Dj received is newer than the version in i’s 
local database, then the newer version updates Dj in the local database. 

(2) Version sequence: A version sequence records all the version numbers that 
i has ever known about a data item. Due to unreliability, it is possible that i has 
not received all the versions of a data item. In particular, the version sequence 
of Dj is VSj =< V\,V 2 , ■.■,Vh > where v\ < V 2 < ... < Vh are all the version 
numbers that i has ever known about Dj. For each v S VSj, i saves in the 
distance between Dj(y) and Dj(vh). 

(3) Dissemination history: For each version number v in each VSj, i maintains 
a dissemination history DHj{v). This history records every time point at which 



There is a potential storage problem here, which we address, but we postpone the 
discussion for now. 
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i received Dj(v) from a node. DHj(v) also contains every time point at which i 
broadcast Dj{v). 

Now we discuss how we limit the amount of storage used. Observe that 
the lengths of each version sequence VSj and dissemination history DHj{v) 
increases unboundedly as i receives more broadcasts. This presents a storage 
problem. A straight-forward solution to this problem is to limit the length of 
each version sequence to a and the length of each dissemination history to l3. 
We call this variant of the ABD policy ABD(a, /3). The drawback of ABD(a, (3) is 
that when the length of a dissemination history DHj(y) is smaller than (3, since 
each dissemination history is limited to (3, other dissemination histories can not 
make use of the free storage of DHj{v). A better solution, which we adopt in 
this paper, is to limit the sum of the lengths of each dissemination history in 
each version sequence. In particular, we use ABD-s to denote the ABD policy 
in which 'Yhi<j<n'n,v^vSj limited to s s must be at least n. 



4.1.2 The Benefit Estimation Method 

When an update on Di occurs, node i estimates the benefit of including its 
latest version of Dj in the broadcast message, for each Dj in the local database. 
Intuitively, i does so using the following procedure. For each node k compute the 
set of versions of Dj that k can have, i.e. the set of versions that were received at 
i after Dj was received. Assume that there are m such versions. Then, compute 
the set of broadcasts from which k could have learned each one of these versions. 
Based on this set compute the probabilities qi,q 2 , that k has each one of 
the possible versions vi,V 2 , ■■■, Vm- Finally, compute the expected benefit to k as 
the sum qi-DIST{v{Dj),vi) + q 2 -DIST\v{Dj),V 2 ) -I-.. .-I- qm'DIST{v{Dj),Vm). 

Formally, node i performs the benefit estimation in five steps: 

(1) Construct an effective version sequence {EV S) of D’f which is a subse- 
quence of V Sj : 



EV Sj = { v|w G VSj and v > v{Dj) and there exists t G DHj(v) such that t > t{Dj) } 

Intuitively, EV Sj is the set of versions of Dj that k can have, as far as i knows. 
In other words, EV Sj contains each version v that satisfies the following two 
properties: (i) v is higher than or equal to the latest version of Dj that i has 
received from k (i.e. v{Dj)), and (ii) i has received at least one broadcast which 
includes Dj{y), and that broadcast arrived later than D^ . 

(2) For each v in EV Sj that is higher than v{Dj), count the effective disse- 
mination number which is the size of the set {t\t G DE[j{v) and t > t{Dj)}, and 
denote this number EDN^iy). Intuitively, EDN^iy) is the number of broadcasts 
from which k could have learned Dj(v), based on i’s knowledge. 

(3) For each v in EVSj, compute rjy which, as we will prove, is the probability 
that the version number of Dj in k’s local database is v. If v = v{Dj), rjy = 

■ Otherwise, rjy = 

p,)EDNf[v) 



1^1 denotes the size of the set A. 



7 
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(4) If the version number of Dj in k’s local database is v, then the estimated 
benefit to k of including in the broadcast message is taken to be the distance 
between Dj{v) and (i.e. DIST{Dj{v),Dj)). Denote this benefit B{D'j,k,v). 

(5) The estimated benefit to k of including Dj in the broadcast message is 

taken to be ^))- Denote this benefit by B{Dj,k). Then 

the estimated benefit B{Dj) of including D* in the broadcast message is: 

^ B{D),k) (1) 

l<k<n 

4.2 Message Construction Step 

The objective of this step is for node i to select a subset S of data items from 
the local database for inclusion in the broadcast message. The set S is cho- 
sen such that the expected net benefit of the message (i.e. the total expected 
inconsistency-reduction benefit minus the cost of the message) is maximized. 

First, node i sorts the estimated benefits of the data items in descending 
order. Thus we have the benefit sequence B{D\^) > B{D\^) > ... > B{D\^. 
Then i constructs the message as follows. If there is no number t between I 
and n such that the sum of the first t members in the sequence is bigger than 
{Ci+t- C 2 ), then i will not broadcast a message. Else, i finds the shortest prefix 
of the benefit sequence such that the sum of all the members in the prefix is 
greater than (Ci -I- m • C 2 ), where m is the length of the prefix, i places the data 
items corresponding to the prefix in the broadcast message. Then i considers 
each member j that succeeds the prefix. If B{Dj) is greater than or equal to C 2 , 
then i puts Dj in the message. 

In section 5 we show that the procedure in this step broadcasts the subset S 
of data items whose net benefit is higher than that of any other subset. 

This concludes the description of the ABD-s policy, which consists of the 
benefit estimation and message construction steps. It is easy to see that the 
time complexity of the policy is 0{n ■ s). 

5 Analysis of the ABD Algorithm 

In this section we prove cost optimality of ABD based on the level of knowledge 
that node i has about the other nodes in the system. The following definitions 
are used in the analysis. 

Definition 1 If at time t there is a broadcast from i which includes Dj , we say 
that a dissemination of Dj occurs at time t, and denote it rj{i,v,t) where v is 
the version number of Dj included in that broadcast. □ 

Definition 2 A dissemination sequence of Dj at time t is the sequence of all 
the disseminations of Dj that occurred from the beginning until time t: 

RSjft) =< rj{ni,vi,ti),rj{n2,V2,t2), ...rj{nra,Vm,tm) > 

where t\ < t 2 < ... < tm <t. U 
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Definition 3 Suppose k receives a message from i which includes D*. Denote 
Dj the version of Dj in k’s local database immediately before the broadcast. If 
the version of Dj is higher than the version of Dj, then the actual benefit to k of 

receiving D], denoted is: B{D],k) = DIST(d^,TJ]) - DIST{D],~D^). 

Otherwise the actual benefit is 0. □ 

In other words, the actual benefit to k of receiving D* is the reduction in the 
distance of from Dj. Observe that the actual benefit can be negative. For ex- 
ample, consider the case where Dj is a numeric value and DIST{D{k), D{k')) = 
\D{k) - D{k')\. If 300, D] = 100 and dJ = 200, then B{D],k) = -100. 

Definition 4 The actual benefit of dissemination rj{i,v,t), denoted B{Dj), is 
the sum of the actual benefits to each node k that receives the message from i 
at t which included Dj . The actual benefit of a broadcast message is the sum of 
the actual benefits of each data item included in the message. □ 

Now we discuss two levels of knowledge of i about the other nodes in the 
system. 

Definition 5 Node i is absolutely reliable on Dj for node k by time t if i has 
received all the broadcast messages which included Dj and were sent between 
t{Dj) and t. i is absolutely reliable on Dj by time t if i is absolutely reliable on 
Dj for each node khy t. i is absolutely reliable by time t if t is absolutely reliable 
on each Dj by t. □ 

Definition 6 Node i is strictly synchronized with Dj at time t if at t Dj in i’s 
local database is the latest version of Dj at t. i is strictly synchronized at time 
t if i is strictly synchronized with each Dj at t. □ 

Obviously, if i is strictly synchronized at time t, then i’s local database is 
identical to the system state at t. 

Observe that if each node j broadcasts Dj whenever an update on Dj occurs, 
then a node i which is absolutely reliable on Dj by time t is strictly synchronized 
with Dj at time t. However, in the ABD policy a node j may decide not to bro- 
adcast the new version of Dj, and thus i is not necessarily strictly synchronized 
with Dj even if i is absolutely reliable on Dj . On the other hand, i can be stric- 
tly synchronized even if it is not absolutely reliable. In other words, ’’absolutely 
reliable” and ’’strictly synchronized” are two independent properties. 

Theorem 3 Let RSj{t) be a dissemination sequence of Dj in which the last 
dissemination is rj{i, v, t). The actual benefit of rj{i, v, f) (i.e. B{Dj)) is a random 
variable. If i is absolutely reliable on Dj by t and strictly synchronized with Dj 
at t, then B{Dj) given by the ABD policy(see Equality 1) is the expected value 
of B{D]).a 

Now we devise a function which allows us to measure the cost efficiency of a 
broadcast. 

Definition 7 The actual net benefit of a broadcast message is the difference 
between the actual benefit of the message and the cost of the message. Denote 
NB{M) the actual net benefit of broadcasting a set of data items M. □ 
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Definition 8 A broadcast sequence at time t is the sequence of all the broadcasts 
in the system from the beginning (time 0) until time t: 



BS{t) =< M{ni,ti),M{n2,t2),...,M{nm,tm) > (2) 

where M{ni,ti) is a message that is broadcast from ri/ at time ti, and t\ <t 2 < 

... <tjn<t.U 

For a node which is both absolutely reliable by t and strictly synchronized at 
t, we have the following theorem concerning the optimality of the ABD policy. 
Theorem 4 Let BS{t) be a broadcast sequence in which the last broadcast 
is M{i,t). The actual net benefit of broadcast M{i,t) (i.e. NB{M{i,t))) is a 
random variable. In particular, let M = {Dl.^, ..., be the set of data 

items broadcast by the ABD policy at time t. If i is absolutely reliable by t and 
strictly synchronized at t, then: 

(1) E[WB{M)] >0 

(2) For any M' which is a subset of k's local database, E[N B{M')] < 
E\NB{M)].U 

Theorem 4 shows that the message broadcast by the ABD policy is optimized 
because the expected net benefit of broadcasting any subset of i’s local database 
is not higher than that of broadcasting this message. Granted, this theorem holds 
under the assumption of strict synchronization and absolute reliability, but i can 
base its decision only on the information it knows. 

In some cases, Theorems 3 and 4 hold for a node which is not strictly syn- 
chronized. 

Consider a data item Di which is a single numeric value that monotonously 
increases as the version number of Di increases. We call this a monotonous data 
item. Assume that the distance function is: DIST{Di{k), Di{k')) = \Di{k) — 
Di{k')\. We call this the absolute distance function. 

For monotonous data items and absolute distance functions, Theorems 3 and 
4 are true when i is absolutely reliable but not necessarily strictly synchronized 
at t. Thus we have the following two theorems. 

Theorem 5 Let RSj {t) be a dissemination sequence where the last dissemination 
is rj{i,v,t). The actual benefit of rj{i,v,t) (i.e. B{Dj)) is a random variable. 
For monotonous data items and absolute distance functions, if i is absolutely 
reliable on Dj by t, then B(D'j) given by the ABD policy (see Equality 1) is the 
expected value of B{Dj).n 

Theorem 6 Let BS{t) be a broadcast sequence, where the last broadcast is 
M{i,t). The actual net benefit of broadcast M{i,t) (i.e. NB{M{i,f))) is a ran- 
dom variable. In particular, let M = D\.^, ..., D\^} be the message bro- 

adcast by the ABD policy at time t. For monotonous data items and absolute 
distance functions, if i is absolutely reliable by t, then: 

(1) E[NB{M)] >0 

(2) For any M' which is a subset of k's local database, E[N B{M')] < 
E\NB{M)\. □ 
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6 Relevant Work 

The problem of data dissemination in peer to peer broadcast networks has not 
been analyzed previously as far as we know. The data broadcasting problem 
studied in [11,20] is how to organize the broadcast and the cache in order to 
reduce the response time. The above works assume a centralized system with 
a single server and multiple clients communicating over a reliable network with 
large bandwidth. In contrast, in our environment these assumptions about the 
network do not always hold, and the environment is totally distributed and each 
node is both a client and a server. 

Pagani et al. ([9]) proposed a reliable broadcast protocol which provides an 
exactly once message delivery semantics and tolerates host mobility and com- 
munication failures. Birman et al. ([4]) proposed three multicast protocols for 
transmitting a message reliably from a sender process to some set of destination 
processes. Unlike these works, we consider a ’’best effort” reliability model and 
allow copies to diverge. 

Lazy replication by gossiping has been extensively investigated in the past 
(see for example [2]). Epidemic algorithms ([16]) such as the one used in Grape- 
vine ([15]) also propagate updates by gossiping. However, there are two major 
differences between our work and the existing works. First, none of these works 
considered the cost of communication; this cost is important in the types of 
novel applications considered in this paper. Second, we consider the tradeoff 
between communication and inconsistency, whereas the existing works do not. 
Alonso, Barbara, and Garcia-Molina ([1]) studied the tradeoff between the gains 
in query response time obtained from quasi-caching, and the cost of checking co- 
herency conditions. However, they assumed point to point communication and 
a centralized (rather than a distributed) environment. 

A recent work similar to ours is TRAPP (see [7]). The similarity is in the 
objective of quantifying the tradeoff between consistency and performance. Ho- 
wever, the main differences are in the basic assumptions. First, the TRAPP 
system deals with numeric data in traditional relational databases. Second, it 
quantifies the tradeoff for aggregation queries. Actually, probably the most fun- 
damental difference is that it deals with the problem of answering a particular 
instantaneous query, whereas we deal with database consistency. Specifically, we 
want the consistency of the whole database to be maximized for as long as pos- 
sible. In other words, we maximize consistency in response to continuous queries 
that retrieve the whole database. 
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Abstract. This paper presents a general methodology for the efficient 
parallelization of existing data cube construction algorithms. We describe 
two different partitioning strategies, one for top-down and one for 
bottom-up cube algorithms. Both partitioning strategies assign subcu- 
bes to individual processors in such a way that the loads assigned to the 
processors are balanced. Our methods reduce inter-processor communi- 
cation overhead by partitioning the load in advance instead of computing 
each individual group-by in parallel as is done in previous parallel ap- 
proaches. In fact, after the initial load distribution phase, each processor 
can compute its assigned subcube without any communication with the 
other processors. Our methods enable code reuse by permitting the use 
of existing sequential (external memory) data cube algorithms for the 
subcube computations on each processor. This supports the transfer of 
optimized sequential data cube code to a parallel setting. 

The bottom-up partitioning strategy balances the number of single at- 
tribute external memory sorts made by each processor. The top-down 
strategy partitions a weighted tree in which weights reflect algorithm 
specific cost measures like estimated group-by sizes. Both partitioning 
approaches can be implemented on any shared disk type parallel ma- 
chine composed of p processors connected via an interconnection fabric 
and with access to a shared parallel disk array. Experimental results pre- 
sented show that our partitioning strategies generate a close to optimal 
load balance between processors. 



1 Introduction 

Data cube queries represent an important class of On-Line Analytical Proces- 
sing (OLAP) queries in decision support systems. The precomputation of the 
different group-bys of a data cube (i.e., the forming of aggregates for every com- 
bination of GROUP BY attributes) is critical to improving the response time 
of the queries [16]. Numerous solutions for generating the data cube have been 
proposed. One of the main differences between the many solutions is whether 
they are aimed at sparse or dense relations [4,17,20,21,27]. Solutions within a 
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category can also differ considerably. For example, top-down data cube compu- 
tations for dense relations based on sorting have different characteristics from 
those based on hashing. 

To meet the need for improved performance and to effectively handle the 
increase in data sizes, parallel solutions for generating the data cube are needed. 
In this paper we present a general framework for the efficient parallelization 
of existing data cube construction algorithms. We present load balanced and 
communication efficient partitioning strategies which generate a subcube com- 
putation for every processor. Subcube computations are then carried out using 
existing sequential, external memory data cube algorithms. 

Balancing the load assigned to different processors and minimizing the com- 
munication overhead are the core problems in achieving high performance on 
parallel systems. The heart of this paper are two partitioning strategies, one for 
top-down and one for bottom-up data cube construction algorithms. Good load 
balancing approaches generally make use of application specific characteristics. 
Our partitioning strategies assign loads to processors by using metrics known to 
be crucial to the performance of data cube algorithms [1,4,21]. The bottom-up 
partitioning strategy balances the number of single attribute external sorts made 
by each processor [4] . The top-down strategy partitions a weighted tree in which 
weights reflect algorithm specific cost measures such as estimated group-by sizes 
[ 1 , 21 ]. 

The advantages of our load balancing methods compared to the previously 
published parallel data cube construction methods [13,14] are: 

— Our methods reduce inter-processor communication overhead by partitio- 
ning the load in advance instead of computing each individual group-by in 
parallel (as proposed in [13,14]). In fact, after our load distribution phase, 
each processor can compute its assigned subcube without any inter-processor 
communication. 

— Our methods maximize code reuse from existing sequential data cube imple- 
mentations by using existing sequential (external memory) data cube algo- 
rithms for the subcube computations on each processor. This supports the 
transfer of optimized sequential data cube code to the parallel setting. 

Our partitioning approaches are designed for standard, shared disk type, par- 
allel machines: p processors connected via an interconnection fabric where the 
processors have standard-size local memories and access to a shared disk array. 
We have implemented our top-down partitioning strategy in MPI and tested it 
on a multiprocessor cluster. We also tested our bottom-up partitioning strategy 
through a simulation. Our experimental results indicate that our partitioning 
strategies generate close to optimal load balancing. Our tests on the multipro- 
cessor cluster showed close to optimal (linear) speedup. 

The paper is organized as follows. Section 2 describes the parallel machine 
model underlying our partitioning approaches as well as the input and the output 
configuration for our algorithms. Section 3 presents our partitioning approach 
for parallel bottom-up data cube generation and Section 4 outlines our method 
for parallel top-down data cube generation. In Section 5 we indicate how our 
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top-down cube parallelization can be easily modified to obtain an efficient par- 
allelization of the ArrayCube method [27]. Section 6 presents the performance 
analysis of our partitioning approaches. Section 7 concludes the paper and dis- 
cusses possible extensions of our methods. 

2 Parallel Computing Model 

We use the standard shared disk parallel machine model. That is, we assume p 
processors connected via an interconnection fabric where processors have stan- 
dard size local memories and concurrent access to a shared disk array. For the 
purpose of parallel algorithm design, we use the Coarse Grained Multicomputer 
(CGM) model [5,8,15,18,23]. More precisely, we use the EM-CGM model [6,7,9] 
which is a multi-processor version of Vitter’s Parallel Disk Model [24,25,26]. 

For our parallel data cube construction methods we assume that the d- 
dimensional input data set R of size N is stored on the shared disk array. The 
output, i.e. the group-bys comprising the data cube, will be written to the sha- 
red disk array. For the choice of output file format, it is important to consider 
the way in which the data cube will be used in subsequent applications. For 
example, if we assume that a visualization application will require fast access to 
individual group-bys then we may want to store each group-by in striped format 
over the entire disk array. 



3 Parallel Bottom-Up Data Cube Construction 

In many data cube applications, the underlying data set R is sparse; i.e., N 
is much smaller than the number of possible values in the given d-dimensional 
space. Bottom-up data cube construction methods aim at computing the data 
cube for such cases. Bottom-up methods like BUC [4] and PartitionCube [part of 
[20]] calculate the group-bys in an order which emphasizes the reuse of previously 
computed sort orders and in-memory sorts through data locality. If the data has 
previously been sorted by attribute A then, creating an AB sort order does not 
require a complete resorting. A local resorting of A-blocks (blocks of consecutive 
elements that have the same attribute A) can be used instead. The sorting of 
such A-blocks can often be performed in local memory and, hence, instead of 
another external memory sort, the AB order can be created in one single scan 
through the disk. Bottom-up methods [4,20] attempt to break the problem into 
a sequence of single attribute sorts which share prefixes of attributes and can 
be performed in local memory with a single disk scan. As outlined in [4,20], the 
total computation time of these methods is dominated by the number of such 
single attribute sorts. 

In this section we describe a partitioning of the group-by computations into p 
independent subproblems. Our goal is to balance the number of single attribute 
sorts required to solve each subproblem and to ensure that each subproblem 
has overlapping sort sequences in the same way as for the sequential methods 
(thereby avoiding additional work). 
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Let Al, Ad be the attributes of the data cube such that \Ai\ > \A2\ > ... 
> \Ad\ where \Ai\ is the number of different possible values for attribute Ai. As 
observed in [20], the set of all groups-bys of the data cube can be partitioned 
into those that contain A\ and those that do not contain A\. In our partitio- 
ning approach, the groups-bys containing Ai will be sorted by Ai. We indicate 
this by saying that they contain Ai as a prefix. The group-bys not containing 
Al (i.e., Al is projected out) contain Ai as a postfix. We then recurse with the 
same scheme on the remaining attributes. We shall utilize this property to parti- 
tion the computation of all group-bys into independent subproblems computing 
group-bys. The load between subproblems will be balanced and they will have 
overlapping sort sequences in the same way as for the sequential methods. In the 
following we give the details of our partitioning method. 

Let X, y, z be sequences of attributes representing sort orders and let A be 
an arbitrary single attribute. We introduce the following definition of sets of 
attribute sequences representing sort orders (and their respective group-bys): 

Bi{x,A,z) = {x,xA} (1) 

Bfix,Ay,z) = Bi_i{xA,y,z) U y, Az), 2 < i < logp-|- 1 (2) 

The entire data cube construction corresponds to the set 5^(0, Ai . . . Ad,%) of 
sort orders and respective group-bys, where d is the dimension of the the data 
cube. We refer to i as the rank of Bi{. . .). The set Bd(0, Ai . . . Ad, 0) is the union 
of two subsets of rank d — 1: Bd-i{Ai, A 2 . . . Ad, 0) and Bd-ifih, A 2 . . . Ad, Ai). 
These, in turn, are the union of four subsets of rank d — 2. A complete example 
for a 4-dimensional data cube with attributes A, B, C, D is shown in Figure 1. 



B4.{%,ABCD,%) 


Bfi%,BCD,A) 


B2{%,CD,BA) 


Bi{%,D,CBA) = {%,D} 


Bi {C,D,BA) = {C, CD} 


B2{B,CD,A) 


Bi{B,D,CA) = {B,BD} 


Bi {BC, D, A) = {BC, BCD} 


BfiA,BCD,%) 


B2{A,CD,B) 


Bi [a,D,CB) = {A, AD} 


Bi {AC, D, B) = {AC, ACD} 


B2{AB,CD,%) 


Bi {AB, D, C) = [AB, ABD} 


Bi {ABC, D, 0) = [ABC, ABCD} 



Fig. 1. Partitioning For A 4-Dimensional Data Cube With Attributes A, B, C, D. 



For the sake of simplifying the discussion, we assume that p is a power of 
2. Consider the 2p B-sets of rank d — log 2 (p) — 1. Let (3 = {B^,B‘^, . . . B'^p) be 
these 2p sets in the order defined by Equation (2). Define 

Shuffle(/3) =< B^ U B^p, B"^ U B‘^p~\B^ U B‘^p~‘^, ...,BPU BP+^ > 

= <ri,...,Bp> 

We assign set Ti = U processor Pi, I < i < p. Observe that 

from the construction of all group-bys in each Tj it follows that every processor 
performs the same number of single attribute sorts. 
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Algorithm 1 Parallel Bottom-Up Cube Construction. 

Each processor Pi, 1 < i < p, performs the following steps, independently 
and in parallel: 

(1) Calculate Pi as described above. 

(2) Compute all group-bys in Pi using a sequential (external-memory) 
bottom-up cube construction method. 

— End of Algorithm — 

Algorithm 1 can easily be generalized to values of p which are not powers 
of 2. We also note that Algorithm 1 requires p < This is usually the case 
in practice. However, if a parallel algorithm is needed for larger values of p, 
the partitioning strategy needs to be augmented. Such an augmentation could, 
for example, be a partitioning strategy based on the number of data items for 
a particular attribute. This would be applied after partitioning based on the 
number of attributes has been done. Since the range p G {2° . . covers 

current needs with respect to machine and dimension sizes, we do not further 
discuss such augmentations in this paper. 

Algorithm 1 exhibits the following properties: 

(a) The computation of each group-by is assigned to a unique processor. 

(b) The calculation of the group-bys in Pi, assigned to processor Pi, requires the 
same number of single attribute sorts for all 1 < i < p. 

(c) The sorts performed at processor Pi share prefixes of attributes in the same 
way as in [4,20] and can be performed with disk scans in the same manner 
as in [4,20]. 

(d) The algorithm requires no inter-processor communication. 

These four properties are the basis of our argument that our partitioning ap- 
proach is load balanced and communication efficient. In Section 6, we will also 
present an experimental analysis of the performance of our method. 



4 Parallel Top-Down Data Cube Construction 

Top-down approaches for computing the data cube, like the sequential PipeSort, 
Pipe Hash, and Overlap methods [1,10,21], use more detailed group-bys to com- 
pute less detailed ones that contain a subset of the attributes of the former. 
They apply to data sets where the number of data items in a group-by can 
shrink considerably as the number of attributes decreases (data reduction). A 
group-by is called a child of some parent group-by if the child can be computed 
from the parent by aggregating some of its attributes. This induces a partial or- 
dering of the group-bys, called the lattice. An example of a 4-dimensional lattice 
is shown in Figure 2, where A, B, C, and D are the four different attributes. 
The PipeSort, PipeHash, and Overlap methods select a spanning tree T of the 
lattice, rooted at the group-by containing all attributes. PipeSort considers two 
cases of parent-child relationships. If the ordered attributes of the child are a 
prefix of the ordered attributes of the parent (e.g., ABCD — >■ ABC) then a simple 
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scan is sufficient to create the child from the parent. Otherwise, a sort is requi- 
red to create the child. PipeSort seeks to minimize the total computation cost 
by computing minimum cost matchings between successive layers of the lattice. 
PipeHash uses hash tables instead of sorting. Overlap attempts to reduce sort 
time by utilizing the fact that overlapping sort orders do not always require a 
complete new sort. For example, the ABC group-by has A partitions that can 
be sorted independently on C to produce the AC sort order. This may allow to 
perform these independent sorts in memory rather than using external memory 
sort. 



ABCD 




ABC ABD ACD BCD 




AB AC AD BC BD CD 




Fig. 2. A 4-Dimensional Lattice. 



Next, we outline a partitioning approach which generates p independent 
subproblems, each of which can be solved by one processor using an existing 
external-memory top-down cube algorithm. The first step of our algorithm de- 
termines a spanning tree T of the lattice by using one of the existing approaches 
like PipeSort, PipeHash, and Overlap, respectively. To balance the load between 
the different processors we next perform a storage estimation to determine ap- 
proximate sizes of the group-bys in T. This can be done, for example, by using 
methods described in [11] and [22]. We now work with a weighted tree. The 
most crucial part of our solution is the partitioning of the tree. The partitio- 
ning of T into subtrees induces a partitioning of the data cube problem into p 
subproblems (subsets of group-bys). Determining an optimal partitioning of the 
weighted tree is easily shown to be an NP-complete problem (by making, for 
example, a reduction to p processor scheduling). Since the weights of the tree 
represent estimates, a heuristic approach which generates p subproblems with 
“some control” over the sizes of the subproblems holds the most promise. While 
we want the sizes of the p subproblems balanced, we also want to minimize the 
number of subtrees assigned to a processor. Every subtree may require a scan- 
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ning of the entire data set R and thus too many subtrees can result in poor 10 
performance. The solution we develop balances these two considerations. 

Our heuristics makes use of a related partitioning problem on trees for which 
efficient algorithms exist, the min-max tree k-partitioning problem [3]. 

Definition 1. Min-max tree k-partitioning: Given a tree T with n vertices and 
a positive weight assigned to each vertex, delete k edges in the tree such that the 
largest total weight of a resulting subtree is minimized. 

The min-max tree fc-partitioning problem has been studied in [3,12,19], and 
an 0(n) time algorithm has been presented in [12]. A min-max ^-partitioning 
does not necessarily compute a partitioning of T into subtrees of equal size and 
it does not address tradeoffs arising from the number of subtrees assigned to a 
processor. We use tree-partitioning as a preprocessing step for our partitioning. 
To achieve a better distribution of the load we apply an over partitioning stra- 
tegy: instead of partitioning the tree T into p subtrees, we partition it into s xp 
subtrees, where s is an integer, s > 1. Then, we use a “packing heuristic” to de- 
termine which subtrees belong to which processors, assigning s subtrees to every 
processor. Our packing heuristic considers the weights of the subtrees and pairs 
subtrees by weights to control the number of subtrees. It consists of s matching 
phases in which the p largest subtrees (or groups of subtrees) and the p smallest 
subtrees (or groups of subtrees) are matched up. Details are described in Step 
2b of Algorithm 2. 

Algorithm 2 Sequential Tree-partition(T, s, p). 

Input: A spanning tree T of the lattice with positive weights assigned to the 
nodes (representing the cost to build each node from it’s ancestor in T). Integer 
parameters s (oversampling ratio) and p (number of processors) . 

Output: A partitioning of T into p subsets Si, . . . , Up of s subtrees each. 

(1) Compute a min-max tree s x p -partitioning of T into s x p subtrees 

?!,..., Tgxp- 

(2) Distribute subtrees Ti, . . . , Tgxp among the p subsets Si,. . . , Sp, s subtrees 
per subset, as follows: 

(2a) Create s x p sets of trees named d), 1 < t < sp, where initially 
Ti = {Ti}. The weight of T) is defined as the total weight of the trees 
in 2). 

(2b) For j = 1 to s — 1 

• Sort the T-sets by weight, in increasing order. W.l.o.g., let Ti , 
. . . , Tsp_(j_i)p be the resulting sequence. 

• Set Ti := Ti U 1 < z < p. 

• Remove T^p_(j_i)p_j+i, 1 < z < p. 

(2c) Set Si = Ti, 1 < i < p. 

— End of Algorithm — 

The above tree partition algorithm is embedded into our parallel top-down 
data cube construction algorithm. Our method provides a framework for paralle- 
lizing any sequential top-down data cube algorithm. An outline of our approach 
is given in the following Algorithm 3. 
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Algorithm 3 Parallel Top-Down Cube Construction. 

Each processor Pi, 1 < i < p, performs the following steps independently 
and in parallel: 

(1) Select a sequential top-down cube construction method (e.g., Pipe- 
Sort, PipeHash, or Overlap) and compute the spanning tree T of the 
lattice as used by this method. 

(2) Apply the storage estimation method in [22] and [11] to determine the 
approximate sizes of all group-bys in T. Compute the weight of each 
node of T; i.e., the cost to build each node from it’s ancestor in T. 

(3) Execute Algorithm Tree-partition (T , s, p) a,s shown above, creating 
p sets . . ., Sp. Each set Si contains s subtrees of T. 

(4) Compute all group-bys in subset Si using the sequential top-down 
cube construction method chosen in Step 1. 

— End of Algorithm — 

Our performance results described in Section 6 show that an over partitioning 
with s = 2 or 3 achieves very good results with respect to balancing the loads 
assigned to the processors. This is an important result since a small value of s 
is crucial for optimizing performance. 



5 Parallel Array-Based Data Cube Construction 

Our method in Section 4 can be easily modified to obtain an efficient paralle- 
lization of the Array Cube method presented in [27]. The Array Cube method is 
aimed at dense data cubes and structures the raw data set in a d-dimensional 
array stored on disk as a sequence of chunks'’’ . Chunking is a way to divide the 
d-dimensional array into small size d-dimensional chunks where each chunk is a 
portion containing a data set that fits into a disk block. When a fixed sequence 
of such chunks is stored on disk, the calculation of each group-by requires a cer- 
tain amount of buffer space [27]. The Array Cube method calculates a minimum 
memory spanning tree of group-bys, MMST, which is a spanning tree of the 
lattice such that the total amount of buffer space required is minimized. The 
total number of disk scans required for the computation of all group-bys is the 
total amount of buffer space required divided by the memory space available. 
The ArrayCube method can now be parallelized by simply applying Algorithm 3 
with T being the MMST. More details will be given in the full version of this 
paper. 

6 Experimental Performance Analysis 

We have implemented and tested our parallel top-down data cube construction 
method presented in Section 4. We implemented sequential pipesort [1] in C-|— k, 
and our parallel top-down data cube construction method (Section 4) in C-k- 1- 
with MPI [2]. As parallel hardware platform, we use a 9-node cluster. One node 
is used as the root node, to partition the lattice and distribute the work among 
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the other 8 machines which we refer to as compute nodes. The root is an IBM 
Netfinity server with two 9-G scsi disks, 512 MB of Ram and a 550-MHZ Pentium 
processor. The compute nodes are 133 MHZ Pentium processors, with 2G IDE 
hard drives and 32 MB of RAM. The processors run LINUX and are connected 
via a 100 Mbit Fast Ethernet switch with full wire speed on all ports. 




Fig. 3. Running Time (in seconds) As A Function Of The Number of Compute Nodes 
(Processors) 



Figure 3 shows the running time observed (in seconds) as a function of the 
number of compute nodes used. For the same data set, we measured the sequen- 
tial time (sequential pipesort [1]) and the parallel time obtained through our 
parallel top-down data cube construction method (Section 4), using an oversam- 
pling ratio of s = 2. The data set consisted of 100,000 records with dimension 
6. The attribute cardinalities for dimensions 1 to 6 where 5, 15, 500, 20, 1000, 
and 2, respectively. Our test data values were sparse and uniformly distributed. 
Figure 3 shows the running times (in seconds) of the algorithm as we increase 
the number of compute nodes. There are three curves shown. Max-time is the 
time taken by the slowest compute node (i.e. the node that received the largest 
workload). Avg-time is the average time taken by the compute nodes. The time 
taken by the root node, to partition the lattice and distribute the work among 
the compute nodes, was insignificant. The optimal time shown in Figure 3 is the 
sequential pipesort time divided by the number of compute nodes (processors) 
used. 

We observe that the max-time and optimal curves are essentially identical. 
That is, for an oversampling ratio of s = 2, the speedup observed is very close 
to optimal. 

Note that, the difference between max-time and avg-time represents the load 
imbalance created by our partitioning method. As expected, the difference grows 
with increasing number of processors. However, we observed that a good part 



138 



F. Dehne et al. 



of this growth can be attributed to the estimation of the cube sizes used in 
the tree partitioning. We are currently experimenting with improved estimators 
which appear to improve the result. Interestingly, the avg-time curve is below 
the optimal curve, while the max-time and optimal curves are essentially identi- 
cal. One would have expected that the optimal and avg-time curves are similar 
and that the max-time curve is slightly above. We believe that this is caused by 
another effect which benefits our parallel method: improved I/O. When sequen- 
tial pipesort is applied to a 10 dimensional data set, the lattice is partitioned 
into pipes of length up to 10. In order to process a pipe of length 10, pipesort 
needs to write to 10 open files at the same time. It appears that the number of 
open files can have a considerable impact on performance. For 100,000 records, 
writing them to 4 files took 8 seconds on our system. Writing them to 6 files 
took 23 seconds, not 12, and writing them to 8 files took 48 seconds, not 16. This 
benefits our parallel method, since we partition the lattice first and then apply 
pipesort to each part. Therefore, the pipes generated in the parallel method are 
considerably shorter. 




Fig. 4. Running Time (in seconds) As A Function Of The Size Of The Data Set 
(number of rows / 1000) 



Figure 4 shows the running times (in seconds) of our top-down data cube 
parallelization as we increase the data size from 100,000 to 1,000,000 rows. Note 
that, the scale is logarithmic. The main observation is that the parallel running 
time {max-time) increases essentially linear with respect to the data size. 

Figure 5 shows the running times as a function of the oversampling ratio s. 
We observe that the parallel running time (i.e., max — time) is best for s = 3. 
This is due to the following tradeoff. Clearly, the workload balance improves as 
s increases. However, as the total number of subtrees, s x p, generated in the 
tree partitioning algorithm increases, we need to perform more sorts for the root 
nodes of these subtrees. The optimal tradeoff point for our test case is s = 3. 
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Fig. 5. Running Time (in seconds) As A Function Of The Oversampling Ratio (s) 



Figure 6 shows the running times (in seconds) of our top-down data cube 
parallelization as we increase the dimension of the data set from 2 to 10. Note 
that, the number of group-bys to be computed grows exponentially with respect 
to the dimension of the data set. In Figure 6, we observe that the parallel running 
time grows essentially linear with respect to the output. We also executed our 
parallel algorithm for a 15-dimensional data set of 10,000 rows, and the resulting 
data cube was of size more than IG. 




dimensions 



Fig. 6. Running Time (in seconds) As A Function Of The Number Of Dimensions Of 
The Data Set. 



Simulation results for our bottom-up data cube parallelization in Section 3 are 
shown in Figure 7. For this method we have so far measured its load balancing 
characteristics through simulation only. As indicated in [4,20], the main indicator 
for the load generated by a bottom-up data cube computation is the number 
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of single attribute sorts. Our partitioning method in Section 3 for bottom-up 
data cube parallelization does in fact guarantee that the subcube computations 
assigned to the individual processor do all require exactly the same number 
of single attribute sorts. There are no heuristics (like oversampling) involved. 
Therefore, what we have measured in our simulation is whether the output sizes 
of the subcube computations assigned to the processors are balanced as well. The 
results are shown in Figure 7. The x-axis represents the number of processors 
p G {2, . . . , 64} and the y-axis represents the largest output size as a percentage 
of the total data cube size. The two curves shown are the largest output size 
measured for a processor and the optimal value (total data cube size / number of 
processors). Five experiments were used to generate each data point. We observe 
that the actual values are very close to the optimal values. The main result is 
that our partitioning method in Section 3 not only balances the number of single 
attribute sorts but also the sizes of the subcubes generated on each processor. 




Number of processors 



Fig. 7. Bottom-Up Cube. Maximum Output Size For One Processor As Percentage Of 
Total Data Cube Size. 



7 Conclusion 

We presented two different, partitioning based, data cube parallelizations for 
standard shared disk type parallel machines. Our partitioning strategies for 
bottom-up and top-down data cube parallelization balance the loads assigned to 
the individual processors, where the loads are measured as defined by the ori- 
ginal proponents of the respective sequential methods. Subcube computations 
are carried out using existing sequential data cube algorithms. Our top-down 
partitioning strategy can also be easily extended to parallelize the ArrayCube 
method. Experimental results indicate that our partitioning methods produce 
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well balanced data cube parallelizations. Compared to existing parallel data cube 
methods, our parallelization approach brings a significant reduction in inter- 
processor communication and has the important practical benefit of enabling 
the re-use of existing sequential data cube code. 

A possible extension of our data cube parallelization methods is to consider 
a shared nothing parallel machine model. If it is possible to store a duplicate of 
the input data set R on each processor’s disk, then our method can be easily 
adapted for such an architecture. This is clearly not always possible. It does solve 
most of those cases where the total output size is considerably larger than the 
input data set; for example sparse data cube computations. In fact, we applied 
this strategy for our implementation presented in Section 6. As reported in [20], 
the data cube can be several hundred times as large as R. Sufficient total disk 
space is necessary to store the output (as one single copy distributed over the 
different disks) and a p times duplication of R may be smaller than the output. 
Our data cube paralelization method would then partition the problem in the 
same way as described in Sections 3 and 4, and subcube computations would be 
assigned to processors in the same way as well. When computing its subcube, 
each processor would read R from its local disk. For the output, there are two 
alternatives. Since the output data sizes are well balanced, each processor could 
simply write the subcubes generated to its local disk. This could, however, create 
a bottleneck if there is, for example, a visualization application following the data 
cube construction which needs to read a single group-by. In such a case, each 
group- by should be distributed over all disks, for example in striped format. To 
obtain such a data distribution, all processors would not write their subcubes 
directly to their local disks but buffer their output. Whenever the buffers are 
full, they would be permuted over the network. In summary we observe that, 
while our approach is aimed at shared disk parallel machines, its applicability 
to shared nothing parallel machines depends mainly on the distribution and 
availability of the input data set R. We are currently considering the problem 
of identifying the “ideal” distribution of input R among the p processors when 
a fixed amount of replication of the input data is allowed (i.e., R can be copied 
r times, 1 < r < p). 
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Abstract. Declustering techniques have been widely adopted in parallel 
storage systems (e.g. disk arrays) to speed up bulk retrieval of multidi- 
mensional data. A declustering scheme distributes data items among 
multiple devices, thus enabling parallel I/O access and reducing query 
response time. We measure the performance of any declustering scheme 
as its worst case additive deviation from the ideal scheme. The goal thus 
is to design declustering schemes with as small an additive error as pos- 
sible. We describe a number of declustering schemes with additive error 
0(log M) for 2-dimensional range queries, where M is the number of 
disks. These are the first results giving such a strong bound for any va- 
lue of M. Our second result is a lower bound on the additive error. In 
1997, Abdel-Ghaffar and Abbadi showed that except for a few stringent 
cases, additive error of any 2-dim declustering scheme is at least one. 

d-l 

We strengthen this lower bound to l7((logM) 2 ) for d-dim schemes 
and to fJ(log M) for 2-dim schemes, thus proving that the 2-dim sche- 
mes described in this paper are (asymptotically) optimal. These results 
are obtained by establishing a connection to geometric discrepancy, a 
widely studied area of mathematics. We also present simulation results 
to evaluate the performance of these schemes in practice. 



1 Introduction 

The past decade has brought dramatic improvement in computer processor speed 
and storage capacity. In contrast, improvement in disk access time has been 
relatively flat. As a result, disk I/O is bound to be the bottleneck for many 
modern data-intensive applications. To cope with the I/O bottleneck, multi-disk 
systems, coupled with a declustering scheme, are usually used. The idea is to 
distribute data blocks across multiple disk devices, so they can be retrieved in 
parallel (i.e., in parallel disk seek operations). Meanwhile, emerging technologies 
in storage area network (e.g. Fibre Channel- Arbitrated Loop, switch-based I/O 
bus, and Gigabit Ethernet) have also enabled one to build a massively parallel 
storage system that contains hundreds or even thousands of disks [1,17]. As 
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the number of disks increases, the efficacy of the adopted declustering scheme 
becomes even crucial. 

Many applications that adopt declustering schemes have to deal with mult- 
idimensional data. These applications include, for example, remote-sensing da- 
tabases [8,9], parallel search trees [5], and multidimensional databases [12]. In 
this paper, we concentrate on multi-dimensional data that is organized as a uni- 
form grid. A good example is remote-sensing (satellite) data in raster format, 
which may contain dimensions such as latitude, longitude, time, and spectrum. 
An important class of queries against multidimensional data is range query. 
A range query requests a hyper-rectangular subset of the multidimensional data 
space. The response time of the query is measured by the access time of the 
disk that has the maximum number of data blocks to retrieve, and our goal is 
to design declustering schemes that minimize query response time. 

Declustering schemes for range queries are proposed in [6,7,3,15,19,22,23, 
21,12,18,11]. We measure the performance of any declustering scheme as its 
worst-case additive deviation from the ideal scheme. Based on this notion, we 
describe a number of 2-dim schemes with (asymptotically) optimal performance. 
This is done by giving an upper bound on the performance of each of these 
schemes as well as a lower bound on the performance of any declustering scheme. 
These are the first schemes with provably optimal behavior. Our results are 
obtained by establishing a connection to geometric discrepancy, a widely studied 
area of Combinatorics. We have been able to borrow some deep results and 
machinery from discrepancy theory to prove our results on declustering. 

The rest of the paper is organized as follows. First, we formally define the 
declustering problem. Then, in Section 2, we summarize related work and pre- 
sent a summary of our contributions. In Section 3, we state the intuition behind 
our results. We briefly describe the relevant results in discrepancy theory and 
how we use them to prove results on declustering schemes. This is followed in 
Section 4 by a description of a general technique for constructing good declu- 
stering schemes from good discrepancy placements. In Section 5, we describe a 
number of declustering schemes, all with provably (asymptotically) optimal per- 
formance. In Section 6, we present a lower bound argument on the performance 
of any declustering scheme. Finally, in Section 7, we present brute-force simula- 
tion results on 2-dim schemes to show their exact (not asymptotic) performance. 
The results show that in practice, all the schemes have very good performance: 
their worst case deviation from the ideal scheme is within 5 for a large range of 
number of disks (up to 500 disks for some of the cases). 

Notation: Even though our techniques can be generalized to any number of 
dimensions, our theoretical upper bound holds only for the case of two dimen- 
sions. For this reason, in sections 3, 4, and 5, we describe our techniques and the 
resulting declustering schemes for the case of two dimensions. The lower bound 
(presented in SectionG) applies to any number of dimensions. 
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1.1 Problem Definition 

Consider a dataset organized as a multi-dimensional grid of NiX N 2 X N^x ■ ■ ■ x 
N 4 tiles. Let {xi,X 2 , ■ ■ - x^) denote a point with coordinate Xi in dimension i. 
Given M disks, a declustering scheme, s, assigns tile (xi, X 2 , . . . Xd) to the disk 
numbered s(xi, X 2 , . . . Xd)- A range query retrieves a hyper-rectangular set of 
tiles contained within the grid. We define the (nominal) response time of query 
Q under scheme s, RT{s, Q), to be the maximum number of tiles from the query 
that get assigned to the same disk. Formally, let tilei{s, Q), i = 0, 1, . . . , M — 1, 
represent the number of tiles in Q that get assigned to disk i under scheme s. 
Then RT{s, Q) = maxo<i<M tilei{s, Q). One may consider the unit of response 
time to be the average disk access time (including seek, rotational, and transfer 
time) to retrieve a data block. Thus, the notion of response time indicates the 
expected I/O delay for answering the query. The problem, therefore, is to devise 
a declustering scheme that would minimize the query response time. 

An ideal declustering scheme would achieve, for each query Q, the optimal 
response time ORT{Q) = |"|(5|/M], where |Q| is the number of tiles in Q. 
The additive error of any declustering scheme s is defined as the maximum 
(over all queries) difference between response time and optimal response time. 
Formally, 



additive error of scheme s = max {RT{Q, s) — ORT(Q)) . 

Note the above definition is independent of grid size. That is, query Q could be 
as large as possible, and the additive error is not necessarily finite. 

The additive error is a measure of the performance of a declustering scheme 
and thus our goal is to design schemes with the smallest possible additive error. 
Finally, when proving our theoretical results, we will frequently omit the ceiling 
in the expression of the optimal response time. This will change the additive 
error by at most one. 



2 Related Work and Our Contributions 

Declustering has been a very well studied problem and a number of schemes for 
uniform data have been proposed [11,18,12,10,22,9,6,7,8]. However, very few of 
these schemes have a good worst case behavior, (e.g., the 2-dim disk modulo 
scheme [11] can have additive error as large as We are aware of three 

schemes with limited guarantee in 2-dimensions. These include two of our ear- 
lier schemes - GRS scheme [6] and Hierarchical scheme [7] - and a scheme of 
Atallah and Prabhakar [3]. For the GRS scheme, we also give some analytical 
evidence of an excellent average case performance. Even these 2-dim guarantees 
are somewhat weak. For the GRS scheme [6], we could prove a bound only when 
M is a Fibonacci number. Atallah and Prabhakar’s scheme is defined only when 
M is a power of two. The hierarchical scheme [7] is constructed recursively from 
other base schemes and the resulting performance depends on the performance 
of these base schemes. 
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In this paper, for the first time, we prove that a number of 2-dim schemes 
have additive error O(logM) for all values of M. We also present brute-force 
simulation results to show that the exact (not asymptotic) additive error is 
within 5 for a large range of number of disks (up to 500 disks in some of the 
cases). 

The case of higher (than two) dimensions appears intrinsically very difficult. 
None of the proposed schemes provide any non-trivial theoretical guarantees in 
higher dimensions. We believe that generalization of the techniques described in 
this paper will result in good higher dimensional declustering schemes in practice. 

A related question is what is the smallest possible error of a declustering 
scheme. Abdel-Ghaffar and Abbadi [2] showed that except for a few stringent 
cases, additive error of any 2-dim scheme is at least one. We strengthen this 
lower bound to I2(log M) for 2-dim schemes, thus proving that the 2-dim schemes 
described in this paper are (asymptotically) optimal. We have also been able to 
generalize our lower bound to l7((logM)^“ ) for d-dim schemes. 

These results have been proved by relating the declustering problem to di- 
screpancy problem - a well studied sub-discipline of Combinatorics. We have 
borrowed some deep results and machinery from discrepancy theory research to 
prove our results. 

We present a general technique for constructing good declustering schemes 
from good discrepancy placements. Given that discrepancy theory is an active 
area of research, we feel that this may be our most important technical contribu- 
tion. It leaves open the possibility that one may take new and improved discre- 
pancy placements and translate them into even better declustering schemes. As 
an evidence of power and generality of our present technique, a straightforward 
corollary of our main theorem implies a significantly better bound for the GRS 
scheme than what we had proved in an earlier paper [6]. 

3 Intuition of Our Results 

All our schemes are motivated by results in discrepancy theory [20]. We give a 
very brief description of the relevant results from discrepancy theory. 

3.1 Discrepancy Theory 

Given any integer M, the goal is to determine positions of M points in a 2- 
dimensional unit square such that these points are placed as uniformly as pos- 
sible. There are several possible ways of measuring uniformity of any placement 
scheme. The definition most relevant to us is following: 

Fix a placement P oi M points and consider any rectangle R (whose sides 
are parallel to the sides of the unit square) . If the points were placed completely 
uniformly in a unit square, we will expect R to contain about area{R)*M points, 
where area{R) denotes the area of R. 

Measure the absolute difference between (number of points falling in R) and 
area{R) * M. This defines the “discrepancy” of placement P with respect to 
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rectangle R. The discrepancy of placement P is defined as the highest value of 
discrepancy with respect to any rectangle R. The goal is to design placement 
schemes with smallest possible discrepancy. 

It is known that any placement scheme must have discrepancy at least 
l7(logM) [20] and several placement schemes with discrepancy O(logM) are 
known in literature. The definition of discrepancy can be generalized to ar- 
bitrary dimensions. In d-dimensions, the known lower and upper bounds are 
l7((logM)^“ ) and 0{{log respectively. These results form the basis of 

our upper and lower bound arguments. 

3.2 Relationship with Declustering Schemes 

We informally argue that a declustering scheme with small additive error can be 
used to construct a placement scheme with small discrepancy, and vice versa. 

Consider a good declustering scheme on an M x M grid G. Because we are 
distributing points among M disks, a good declustering scheme will have 
roughly M instances of each disk. Let us focus on just one disk, say M instances 
of disk zero. For any query Q, its response time is defined as the maximum 
number of instances of any disk contained within Q. We will approximate the 
response time with the number of instances of disk zero contained within Q. 
Then the additive error of Q is approximately equal to 

number of instances of disk zero contained within Q — 

Now suppose we compress the grid into a unit square (so that both x and y 
dimensions are compressed by a factor of M) and consider the positions of disk 
zero. The original query Q gets compressed into a rectangle R of area So 
that the discrepancy of this placement scheme with respect to R is the difference 
of (number of instances of disk zero contained within R) and area{R) * M. But 
the number of instances of disk zero contained within R is equal to the number of 
instances of disk zero contained within Q, and area{R) * M is equal to This 
implies that the discrepancy of the placement scheme is equal to the additive 
error of the declustering scheme. 

The next section describes how to obtain a good declustering scheme from a 
good placement scheme. 

4 From Discrepancy to Declustering 

Our overall strategy can be stated in the following three steps: 

1. Start with a placement scheme Pq in the unit square with M points. By 
multiplying x and y dimensions by M, we obtain M points in an M x M 
grid. We can think of these as approximate positions of disk zero. We call 
this placement P. 

2. The positions of disk zero in P may not correspond to grid points (i.e., 
their x or y-coordinates may not be integer). In this step we map these M 
arbitrary points to M grid points. 
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Fig. 1. Progress of Steps 1-3 for M = 5, starting with an initial placement scheme 
Po = (0, 0), (0.2, 0.56), (0.4, 0.22), (0.6, 0.78), (0.8, 0.44). = Ny = M. 



3. The M grid points in the previous step give the instances of disk zero in an 
M X M grid. Based on this “template” , we first place all other disks in the 
M X M grid and then generalize it to a declustering scheme for an arbitrary 

Ny; X Ny grid. 

The goal is to be able to start with a placement scheme Pq with small discre- 
pancy, and still guarantee a small additive error for the resulting declustering 
scheme obtained from Steps 1-3 above. Indeed, our construction guarantees that 
if we start with a placement scheme Pq with discrepancy k, then the additive 
error of the resulting declustering scheme is at most O (jt + . Thus, picking 

any placement scheme Pq with k = 0(logM) from the discrepancy theory li- 
terature (e.g. [25,13]), we can construct a declustering scheme with O(logM) 
additive error. In the rest of this section, we describe Steps 1-3 in detail, along 
with the necessary claims and their proofs. 

4.1 Step-1 

Start with a placement scheme Pq on M points for a unit grid and scale up 
each dimension by a factor of M. Let the resulting points be (xq, Vo), ■ ■ ■ 

Dm-i)- We call this new placement scheme P. 

Figures 1 (a) and (b) show an example for M = 5. The resulting place- 
ment P in Figure 1 (b) contains points (0, 0), (1.0, 2.80), (2.0, 1.10), (3.0, 3.90) 
and (4.0,2.20). 

We redefine discrepancy for scheme P, which is imposed on an M x M 
grid rather than on an unit grid. Our definition differs in two aspects from the 
standard definition. First of all, we have scaled up the area of any rectangle 
by a factor of (a scaling of M in each dimension). Second we allow more 
general rectangles for measuring discrepancy: we consider rectangles whose x or 
y-coordinates may come from a “wrap-around” interval, i.e., an interval of the 
form [j, j-l-1, . . . , M— 1, 0, 1, 2, fcj. We denote these as “wrap-around” rectangles. 
The introduction of the notion of “wrap-around” rectangles is needed when we 
generalize the scheme to arbitrary grids in Step-3. 
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Fig. 2. An example wrap-around rectangle 
(ji U g2 U p3 U (j4. 



Fig. 3. Points of placement P are denoted 
by asterisks; points of P' are denoted by 
dots. For each point in P, its mapping in 
P' is the nearest dot point. In this example, 
\Smin\ = 3, IS'fll = 5 and \Smax\ = 6. 



Pictorially, imagine that the left and right sides of the grid are joined and 
similarly the top and the bottom sides of the grid are joined. Because each “wrap- 
around” interval is a disjoint union of one or two (standard) intervals, a “wrap- 
around” rectangle is a disjoint union of one, two, or four disjoint (standard) 
rectangles in the grid. Pictorially, the last case will correspond to four standard 
rectangles in the four corners of the grid. Figure 2 shows an example. 



Definition 1. Fix a placement P of M points in an M x M grid. Then given 
any rectangle (which could be a ‘wrap-around” rectangle) R, discrepancy of P 
with respect to R is defined as 



number of points falling in R 



area{R) 

M 



The discrepancy of placement P is defined as the highest value of discrepancy 
with respect to any (including “wrap-around” ) rectangle R. 



Lemma 1. The discrepancy of P is at most four times the discrepancy of Pq. 

Proof. This can be easily verified by observing that (1) the scaling of Pq to P 
does not affect the discrepancy value, and (2) the introduction of “wrap-around” 
rectangles in P may multiply the discrepancy by at most a factor of four. 

4.2 Step-2: 

We place the point with the smallest a;-coordinate in the zeroth column, the 
point with the next smallest x-coordinate in the first column and so on. We do 
an analogous thing with y-coordinates and rows. 

Formally the process is: sort xq,xi,X2, ■ • ■ ,xm-i in increasing order (break 
ties arbitrarily) and let 0 < Wi < M be the rank of Xi within the sorted order. 
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Similarly sort yi,y2, ■ ■ ■ , J/m in increasing order (break ties arbitrarily) and let 
0 < Zi < M be the rank of yi within the sorted order. Then map (xi,yi) to the 
grid point (tUj, Zi). Let us call the new placement scheme P' . Figure 1 (c) shows 
the resulting P' following the same example. 

We will use the following claim in Steps 2 and 3. 

Claim. Because Wi’s are all distinct, each column contains exactly one point. 
Similarly zts are all distinct, and thus each row contains exactly one point. 



Lemma 2. If discrepancy of scheme P is k then the discrepancy of scheme P' 
is at most 5k + (In all declustering schemes we describe later in Section 5, 
we will start with an initial placement scheme Pq with discrepancy O(logM), so 
that the ^ term will be equal to , which is vanishingly small.) 

Proof. We will first prove that the positions of point in P do not change too 
much as we shift them around to obtain P' . We will prove that \wi — Xi\ < k 
and \zi -yi\<k. 

Consider the rectangle R whose four corners are (0, 0), (ccj, 0), (0, M), and 
{xi, M). The area of R is Xi * M and the number of points falling in R is equal 
to the number of points whose a;-coordinate is less than or equal to Xi, which is 
equal to the rank of Xi in the sorted ordering of Xq,Xi,X2, . ■ . ,Xm-i, which is 
Wi. So by discrepancy theory, \wi — \ < k, thus \wi — Xi\ < k. By a similar 

argument we can prove that \zi — yi\ < k. 

We are ready to prove the lemma. Consider any rectangle R of dimension 
c X r. Let Sr denote the set of points that fall inside R under scheme P' . We 
are interested in the cardinality of Sr. 

Consider the rectangle Rmin of dimension (c—2k) x (r — 2k) that is obtained 
by pushing in each side of i? by a distance of k. Also consider the rectangle Rmax 
of dimension (c + 2k) x (r + 2k) that is obtained by pushing out each side of R 
by a distance of k. Let S'mm (resp. Smax) denote the set of points that fall inside 
Rmin (resp. Rmax) Under scheme P. Figure 3 shows the situation. 

We already showed \wi~Xi\ < k and \zi — yi\ < k. Thus it follows that all the 
points in Rmin (under scheme P) must fall in R (under scheme P') and no point 
outside Rmax (under scheme P) can fall in R (under scheme P'). We conclude 

\Smin\ < I-S’kI < \Smax\- (1) 



Because scheme P has discrepancy k, 

area(Rmin) , {c-2k){r-2k) 

\Sm.n\> k= k 



(2) 



and 






area(Rmax) , , 
M 



(c + 2k){r + 2k) 
M 



+ k. 



(3) 



From Equations 1, 2, and 3, — k < |S'_r| < 
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Discrepancy of P' with respect to R is 



I*''! - W 



, cr (c — 2k)(r — 2k) 
< max — - f 
- M M 



, (c+2k){r + 2k) , cr 

M +'=-M 



2fc(c + r) + 4/c^ 

M 



k. 



Since c, r < M, we get that discrepancy of P' is bounded by 5k + 



4.3 Step-3 

We know from Claim 4.2 that the placement scheme P' contains exactly one 
point in each column of the M x M grid. Let cr(x) be the row index of the point 
in column x. Then, given any arbitrary x Ny grid, the declustering scheme 
maps the point {x, y), 0 < x < N^, 0 < y < Ny, to disk {y — a{x mod M)) mod 
M. Figure 1(d) shows the disk assignment of the left-bottom M x M subgrid. 
Pictorially, the decluster scheme takes this “pattern” and repeats it throughout 
the rest of the grid. 

Theorem 1. If we start with a placement scheme Pq with discrepancy k then 
the resulting declustering scheme D' will have additive error at most O (jt + . 

This implies that by starting with a placement scheme with O(logM) discre- 
pancy, we can construct a declustering scheme with O(logM) additive error. 

We omit the proof (these can be found in www.cs.umd.edu/users/randeep/ 
disc.ps) because of a lack of space. In Section 6, we state a converse of this 
theorem (Theorems 2, 3). 



Higher- dimensional Extensions. There are several possible ways of extending 
the technique described above to obtain higher-dim declustering schemes. We 
can start with a higher-dimensional placement scheme with good discrepancy to 
obtain a higher-dimensional declustering scheme. Alternatively, we can start with 
a 2-dim declustering scheme (obtain using the technique outlined in the previous 
section) and generalize it to higher dimension, using a recursive technique as 
described in [6,7] 



5 Description of Declustering Schemes 

In this section, we present several 2-dim declustering schemes with O(logM) 
additive error. We describe more than one schemes in the hope that users may 
have other constraints (besides trying to minimize worst case response time) and 
some of these schemes may be better suited than the others. 
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5.1 Corput’s Scheme 

The first placement scheme we consider is given by Van der Corput[25]. The M 
points given by this scheme are 

{ > 

where keyi is computed as following: Let ak-i . . . oioo be the binary representa- 
tion of i, where ag is the least significant bit. Then kepi = ^ -I- ^ -I- ^ H 1- • 

Now we apply the three steps outlined in the previous section. In Step 1, we 
multiply each co-ordinate by M to obtain the set {(z, M * keyi),0 < i < M}. In 
Step 2, we need to map these points to integer co-ordinates. The a;-coordinate 
is already an integer. The ^-coordinate M * keyi gets mapped to the rank of 
M * keyi in the set {M * keyi,0 < i < M}. We observe that this is equal to 
the rank of keyi in the set {keyi,0 < i < M}. Let RANK(z) denote the rank 
of this element. Then step 3 dictates that the point (x,y) should map to disk 
{y — RANK{x mod M)) mod M. 

We summarize these steps below. 

step-1: construct M pairs (i,keyi) for 0 < z < M, where keyi is computed 
as following: Let ak-i ■ ■ ■ aiUg be the binary representation of z, where ag is the 
least significant bit. Then keyi = ^ + ^ + ^ + -“ + 

step-2: sort the first components based on key values. This will give a per- 
mutation on 0, 1, .., M — 1. Call the resulting permutation PERM(M). Compute 
the inverse permutation, RANK by 

for i = 0 to M - 1 {RANK{PERM{i)) = i} 
step-3: map point {x, y) to disk {y — RANK{x mod M)) mod M. 

The next two schemes, GRS and Faure’s scheme, are constructed in an ana- 
logous manner, except that they start with a different initial placement. Rather 
than describing the steps from discrepancy to declustering scheme, we present 
the declustering schemes directly. 



5.2 GRS Scheme 

The GRS scheme was first described by us in [6] and we proved that whenever 
M is a Fibonacci number, the response time of any query is at most three times 
its optimal response time. Our proof was in terms of “gaps” of any permutation 
and worked only when M was a Fibonacci number. 

It turns out that the same scheme can be obtained from a placement scheme 
with discrepancy O(logM) [16] [20, page 80, exercise 3] (described below). Thus 
Theorem 1 implies that the additive error of GRS scheme is O(logM) for any 
M. This is another evidence of the generality and power of Theorem 1. 

step-1: construct M pairs (z, keyi) for 0 < z < M, where keyi is the fractional 
part of 

Step-2 and step-3 are the same as those described in Section 5.1. 
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5.3 Faure’s Scheme 

The following scheme is based on Faure’s placement scheme [14]. This scheme has 
two parameters: a base b and a permutation a on {0, 1, . . . , 6 — 1}. For suitable 
choice of the parameters, this is the best known construction (in terms of the 
constant factor in the discrepancy bound). 

step-1: construct M pairs (i,keyi) for 0 < z < M, where keyi is computed 
as following: Let Qk-i ■ ■ ■ aiUo be the representation of i in base b (i.e. i = 
®i^)> where 0 < < 6 — 1 for all j and oq is the least significant digit. 

Then keyi = ^ + ^ + ^ + -- -+ 

Step-2 and step-3 are the same as in Section 5.1. 

Please note that the CORPUT scheme is a special case with b = 2 and a 
being the identity permutation. 



5.4 Generalized Hierarchical Scheme 

The Hierarchical Scheme is presented in [7]. It is based on a technique of construc- 
ting declustering scheme for M = mi x m 2 x . . . x ruk disks, given declustering 
schemes Di for m^, 1 < z < fc, disks. Note that rrii may be the same as rrij for 
i ^ j. The idea is that using strictly optimal declustering schemes for all M < p 
for some prime p one can construct good declustering schemes for any M that 
can be expressed as a product of the first p prime numbers. The hierarchical 
declustering scheme, for a fixed p, is therefore only defined for those M which 
can be expressed as a product of the first p prime numbers. Due to space limit, 
we refer the readers to [7] for the detailed description of hierarchical schemes. 

We now extend the hierarchical scheme to any number of disks. The original 
hierarchical declustering scheme is a column permutation declustering scheme 
(CPDS). A CPDS scheme is defined by a function F : {0 . . . M— 1} — >■ {0 . . . M — 
1}, such that the grid point (x,F{x)), x = 0, 1, ... ,M — 1, is assigned to disk 
zero and, in general, point {x, y) is assigned to disk {y — F{x mod M)) mod M. 
We have seen one such scheme earlier: the declustering scheme D in Figure 1(d) 
is a CPDS. Given a CPDS which is only defined for certain values of M, we 
extend it to all values of M as following: Given M = rz for which the CPDS is 
not defined, we find the smallest number n' > n, such that the CPDS is defined 
for M = rz'. Consider a, n' x n' grid G under the CPDS. Again, we will restrict 
our attention to disk zero in this grid. We construct a CPDS for M = rz as 
follows. Take the first rz columns of G. Note that there are rz instances of disk 
zero in these columns. We assign each of these disks a unique rank (essentially 
sort them, break ties in any way) between 0 and rz — 1, based on their row 
positions (p-coordinates) . Let the rank of disk zero in column z be r^. Define a 
function F'{i) = r^. The CPDS defined by function F' is the CPDS for M = n. 

We call this the generalized hierarchical declustering scheme. We can show 
that the generalized hierarchical declustering scheme has a O(logM) additive 
error (proof omitted). 
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6 Lower Bound 

An ideal declustering scheme is the one whose performance is strictly optimal on 
all range queries. An interesting question is whether any realizable declustering 
scheme is ideal, and if not how close can any declustering scheme get to the ideal 
scheme. Abdel-Ghaffar and Abbadi [2] showed that except for a few stringent 
cases, additive error of any 2-dim scheme is at least one. We strengthen their 
result to give (asymptotically) tight lower bound of l7(logM) in 2-dim and a 
lower bound of l7((log M) ) for any d-dim scheme, thus proving that the 2-dim 
schemes described in this paper are (asymptotically) optimal. 

Theorem 2. Given any 2-dim declustering scheme D for M disks and any M x 
M grid G, there exists a query Q in the grid G, such that for query Q, RT{Q) — 
ORT{Q) = QilogM). In other words, for any 2-dim declustering scheme, there 
are queries on which the response time is at least l7(logM) more than the optimal 
response time. 



Theorem 3. Given any d-dim declustering scheme D for M disks and any d- 
dim grid G with all side lengths M, there exists a query Q in the grid G, such 
that for query Q, RT{Q) — ORT{Q) = l7((log M)t ). In other words, for any 
d-dim declustering scheme, there are queries on which the response time is at 
least l7((logM)T^) more than the optimal response time. (The actual lower 
hound is slightly larger, based on the discrepancy lower hound of [4-].) 

We will sketch a proof of Theorem 2. Proof of Theorem 3 requires several 
additional tricks and is omitted. 

Proof. Let I? be a declustering scheme and G he an M x M grid as in the 
statement of the theorem. Since the grid points in G are mapped to the 
disks {0 ... M — 1}, there must exist a disk i G {0 ... M — 1}, such that there are 
at least n> M instances of disk i in G. W.l.o.g. we assume i = 0. Let us remove 
all disks except disk 0 from G. Let us also remove n — M instances of disk 0 
from G, thus leaving exactly M points(disks) in G. We will denote by p{Q) the 
number of points contained in a rectangular query Q. We will show that there 
is a query Q such that p{Q) — ORT{Q) = fHlogM). Because there are at least 
p{Q) instances of disk 0 in Q under the declustering scheme D, this will imply 
RT{Q) - ORT{Q) > p{Q) - ORT{Q) = f?(logM). 

Our proof strategy is following: We will obtain a placement scheme from the 
positions of the M points in G. It is known that any placement scheme has 
discrepancy l7(logM) with respect to at least one rectangle R. We want to use 
R to construct the query Q with large additive error. There are two problems 
with this simple plan. The first problem is that the boundary of R may not be 
aligned with the grid lines. This is fixed by taking a slightly smaller rectangle 
whose boundary lies on the grid lines and arguing that this new rectangle also 
has a high discrepancy. The second problem is more serious. Remember that 
discrepancy is defined as the absolute difference of expected and actual number 
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Table 1. Additive errors of various schemes. Each entry in the table indicates the 
maximum number of disks for which the corresponding additive error is guaranteed. 



additive error = 


0 


1 


2 


3 


4 


5 


Corput 


3 


8 


34 


130 


273 


470 


Faure-b5 


3 


10 


27 


106 


140 


275 


Faure-b9 


3 


9 


45 


90 


414 


538 


Faure-b36 


3 


11 


36 


109 


306 


368 


GRS 


3 


22 


94 


391 


553 


- 


Hierarchical 


3 


11 


45 


95 


200 


- 



of points. So it is possible that R may be receiving fewer points than expected. 
In this case, we can only claim that the corresponding query is receiving fewer 
(than OPT) instances of disk zero, not enough to prove a large additive error. 
The way around is to observe that the grid G as a whole has zero discrepancy. So 
if R receives fewer points than expected, some other rectangle must be receiving 
more points than expected, and we construct a query from that rectangle. 



7 Simulation Results 

We present simulation results that compare the actual additive errors of the 
various schemes described in the previous section. For Faure’s scheme, we tried 
three variations: 6 = 5, cr = 0, 3, 2, 1, 4; & = 9, cr = 0, 5, 2, 7, 4, 1, 6, 3, 8; and b = 
36, cr = 0,25,17,7,31,11,20,3,27, 13,34,22,5,15,29,9,23,1,18,32,8,28,14,4,21,33,12, 
26,2,19,10, 30,6,16,24,35. For the generalized hierarchical scheme, it is construc- 
ted solely based on the three optimal schemes for M = 2, 3 and 5. 

We compute additive error of each scheme for a large range of M . Fix M, 
it is known [6,7] that in order to compute the additive error of any permuta- 
tion declustering scheme, it is enough to consider all possible queries, including 
“wrap-around” queries, in an M x M grid. We vary M from two to a few hun- 
dreds. Because the simulation is time consuming the results we present here 
are what can be obtained in reasonable times (the longest runs took a week). 

We present the results in a tabular format as shown in Table 1. The numbers 
at the top row represent additive errors, ranging from 0 to 5. The number in each 
of the table cells represents the max number of disks for which the corresponding 
additive error is guaranteed. For example, Corput’s scheme guarantees that when 
M < 3 the additive error is zero; when M < 8 its additive error is at most 1; 
when M < 34, the additive error is at most 2, etc. 

The table shows that all schemes provide very good worst case performance: 
the deviation from the ideal scheme are at most 5 for a large range of M . The 
most notable is GRS, which guarantees an additive error at most 4 for M < 553. 

^ Given M, it takes 0(M®) to compute the additive error of a scheme. For some sche- 
mes, we manage to reduce the complexity to 0(M®) or O(M^) by taking advantage 
of a dynamic programming technique and some special properties of the schemes. 
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Note this does not mean that GRS is better than all other schemes for all values 
of M. A better strategy is to use a hybrid scheme: given M, select the scheme 
with the lowest additive error. 

Given the small additive errors of these schemes, we feel that other perfor- 
mance metrics, such as average additive error and ratio to the optimal, are of less 
importance. An additive error within 5 translates into less than 50 milli-second 
difference in practice (assuming 10 ms disk seek time). Taking into account seek 
time variation, the response time is already optimal in a statistical sense. No- 
netheless, we leave it to the users to select from these schemes the one that 
best fits their requirements (for example in a multiuser environment the average 
response time may be more important). 

8 Conclusion 

Declustering is a popular technique to speed up bulk retrieval of multidimensio- 
nal data. This paper focuses on range queries for uniform data. Even though this 
is a very well-studied problem, none of the earlier proposed schemes have pro- 
vable good behavior. We measure the additive error of any declustering scheme 
as its worst case additive deviation from the ideal scheme. In this paper, for 
the first time, we describe a number of 2-dimensional schemes with additive 
error 0(log M) for all values of M. We also present brute-force simulation re- 
sults to show that the exact (not asymptotic) additive error is quite small for a 
large range of number of disks. We prove that this is the best possible analyti- 
cal bound by giving a matching lower bound of f?(logM) on the performance 
of any 2-dimensional declustering scheme. We generalize this lower bound to 
l7((logM)T^ ) for d-dimensional schemes. 

Our main technical contribution is a connection between declustering pro- 
blem and discrepancy theory, a well studied sub-discipline of Gombinatorics. We 
give a general technique for mapping any good discrepancy placement scheme 
into a good declustering scheme. Using this technique, we construct new declu- 
stering schemes built upon Van der Gorput’s and Faure’s discrepancy place- 
ments. We note there exist many more sophisticated discrepancy schemes that 
are known to have good discrepancy in practice (e.g. the Net-based schemes [24, 
20]). Whether they will result in better declustering schemes (than Gorput’s and 
Faure’s schemes) is an interesting question that requires more experiments to 
answer. 
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Abstract. Applications like Online Analytical Processing depend hea- 
vily on the ability to quickly summarize large amounts of information. 
Techniques were proposed recently that speed up aggregate range que- 
ries on MOLAP data cubes by storing pre-computed aggregates. These 
approaches try to handle data cubes of any dimensionality by dealing 
with all dimensions at the same time and treat the different dimensions 
uniformly. The algorithms are typically complex, and it is difficult to 
prove their correctness and to analyze their performance. We present a 
new technique to generate Iterative Data Cubes (IDC) that addresses 
these problems. The proposed approach provides a modular framework 
for combining one-dimensional aggregation techniques to create space- 
optimal high-dimensional data cubes. A large variety of cost tradeoffs 
for high-dimensional IDC can be generated, making it easy to find the 
right configuration based on the application requirements. 



1 Introduction 

Data cubes are used in Online Analytical Processing (OLAP) [4] to support 
the interactive analysis of large data sets, e.g., as stored in data warehouses. 
Consider a data set where each data item has d functional attributes and a 
measure attribute. The functional attributes constitute the dimensions of a d- 
dimensional hyper-rectangle, the data cube. A cell of the data cube is defined 
by a unique combination of dimension values and stores the corresponding value 
of the measure attribute. An example of a data cube defined for a view on 
the TPC-H benchmark database [19] might have the total price of an order as 
the measure attribute and the region of a customer and the order date as the 
dimensions. It provides the aggregated total orders for all combinations of regions 
and dates. Queries issued by an analyst who wants to examine how the customer 
behavior in different regions changes over time do not need to access and join the 
“raw” data in the different tables. Instead the information is readily available 
and hence can be aggregated and summarized from the data cube. Our work 
focuses on Multidimensional OLAP (MOLAP) systems [14] where data cubes 
are represented in terms of multidimensional arrays (e.g., dense data cubes). 

An aggregate range query selects a hyper-rectangular region of the data cube 
and computes the aggregate of the values of the cells in this region. For interac- 
tive analysis it is mandatory to provide fast replies for these queries, no matter 
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how large the selected region. To achieve this, aggregate values for regions of the 
data cube are pre-computed and stored to reduce on-the-fly aggregation costs. 
We will refer to a data cube that contains such pre-computed values as a pre- 
aggregated data cube. Whenever necessary, the term original data cube is used 
for a cube without such pre-computed aggregates (i.e., which is obtained directly 
from the data set). Note, that pre-computation increases update costs since an 
update to a single cell of the original data cube has to be propagated to all cells 
in the pre-computed data cube that depend on the updated value. Also, storing 
additional values increases the storage cost. The choice of the query-update- 
storage cost tradeoff depends on the application. While “what-if” scenarios and 
stock trading applications require fast updates, for other applications overnight 
batch processing of updates suffices. But even batch processing poses limits on 
the update cost which depend on the frequency of updates and the tolerated 
period of inaccessibility of the data. 

In this paper space-optimal techniques for MOLAP systems are explored, 

i.e.. Iterative Data Cubes are generated by replacing values of the original data 
cube with pre-computed aggregates. The space-optimality argument would not 
apply to sparse data cubes where empty cells are not stored (e.g.. Relational 
OLAP [14]). The main contributions of Iterative Data Cubes are: 

1. For each dimension a different one-dimensional technique for pre-computing 
aggregate values can be selected. Thus specific properties of a dimension, 
e.g., hierarchies and domain sizes, can be taken into account. 

2. Combining the one-dimensional techniques is easy. This greatly simplifies 
developing, implementing and analyzing IDCs. In contrast to previous ap- 
proaches, dealing with a high-dimensional IDC is as simple as dealing with 
the one-dimensional case. 

3. IDCs offer a greater variety of cost tradeoffs between queries and updates 
than any previous technique and cause no space overhead. 

4. They generalize some of the previous approaches, thus providing a new fra- 
mework for comparing and analyzing them. For the other known techniques 
we show analytically that our approach at least matches their query-update 
performance tradeoffs. 

In Sect. 2 related work is presented. The Iterative Data Cube technique is 
described in Sect. 3. There algorithms for querying and updating Iterative Data 
Cubes are discussed as well. Section 4 contains examples for one-dimensional 
pre-aggregation techniques and illustrates how those techniques can be used for 
an application. In Sect. 5 we discuss how IDC performs compared to the previous 
approaches. Section 6 concludes this paper. 

2 Related Work 

An elegant algorithm for pre-aggregation on MOLAP data cubes is presented 
in [11]. We refer to it as the Prefix Sum technique (PS). The essential idea is 
to store pre-computed aggregate information so that range queries are answered 
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in constant time (i.e., independent of the selected ranges). This kind of pre- 
aggregation results in high update costs. In the worst case, an update to a single 
cell of the original data cube requires recomputing the whole PS cube. The Re- 
lative Prefix Sum technique (RPS) [6] reduces the high update costs of PS, while 
still guaranteeing a constant query cost. RPS is improved by the Space- Efficient 
Relative Prefix Sum (SRPS) [17] which guarantees the same query and update 
costs as RPS, but uses less space. For dynamic environments Geffner et al. pro- 
posed the Dynamic Data Cube (DDC) [5] which balances query and update costs 
such, that both are provably poly-logarithmic in the domain size of the dimen- 
sions for any data cube. DDC causes a space overhead which is removed by the 
Space-Efficient Dynamic Data Cube (SDDC) [17]. SDDC improves on DDC by 
reducing the storage costs, while at the same time providing less or equal costs 
for both queries and updates. The Hierarchical Cubes techniques (HC) [3] gene- 
ralize the idea of RPS and SRPS by allowing different tradeoffs between update 
and query cost. Two different schemes are proposed - Hierarchical Rectangle 
Cubes (HRC) and Hierarchical Band Cubes (HBC). 

The above techniques are the ones that are most related to IDC. They ex- 
plore query-update cost tradeoffs at no extra storage space (except RPS and 
DDC, which were replaced with the space-efficient SRPS and SDDC) for MO- 
LAP data cubes. Like IDC they are only applicable when the aggregate operator 
is invertible (e.g., SUM) or can be expressed with invertible operators (e.g., AVG 
(average)). Iterative Data Cubes generalize PS, SRPS, and SDDC. For Hierar- 
chical Cubes we show that no better query-update cost tradeoffs than for IDC 
can be obtained. Note that all of the above techniques, except PS, are difficult 
to analyze when the data cube has more than one dimension. For instance, the 
cost formulas for the Hierarchical Cubes are so complex, that they have to be 
evaluated experimentally in order to find the “best suited” HC for an application. 

In [7] a new SQL operator, CUBE or “data cube”, was proposed to support 
online aggregation by pre-computing query results for queries that involve grou- 
ping operations (GROUP BY). Our notion of a data cube is slightly different from 
the terminology in [7]. More precisely, the cuboids generated by CUBE (i.e, the 
results of grouping the data by subsets of the dimensions) are data cubes as 
defined in this paper. The introduction of the CUBE operator generated a sig- 
nificant level of interest in techniques for efficient computation and support of 
this operator [1,2,8,9,10,12,13]. These techniques do not concentrate on efficient 
range queries, but rather on which cuboids to pre-compute and how to efficiently 
access them (e.g., using index structures). Since our technique can be applied 
to any cuboid which is dense enough to be stored as a multidimensional array, 
IDC complements research regarding the CUBE operator. For instance, by adap- 
ting the formulas for query and update costs, support for range queries can be 
included into the framework for selecting “optimal” cuboids to be materialized. 
The fact that Iterative Data Cubes are easy to analyze greatly simplifies this 
process. 

Smith et al. [18] develop a framework for decomposing the result of the CUBE 
operator into view elements. Based on that framework algorithms are developed 




162 



M. Riedewald, D. Agrawal, and A. El Abbadi 



that for a given population of queries select the optimal non-redundant set of 
view elements that minimizes the query cost. An Iterative Data Cube has pro- 
perties similar to a non-redundant set of view elements. It contains aggregates 
for regions of the original data cube, does not introduce space overhead, and 
allows the reconstruction of the values of the cells of the original data cube. Ho- 
wever, in contrast to [18] the goal of IDC is to support all possible range queries 
in order to provide provably good worst case or average query and update costs. 

Vitter et al. [20,21] propose approximating data cubes using the wavelet 
transform. While [20] explicitly deals with the aspect of sparseness (which is not 
addressed in this paper) [21], like IDC, targets MOLAP data cubes. Wavelets 
offer a compact representation of the data cube on multiple levels of resolution. 
This makes them particularly suited for returning fast approximate answers. 
Using wavelets to encode the original data cube, however, increases the update 
costs and does not result in a better worst case performance when exact results 
are required. While [21] proposes encoding the pre-aggregated data cube which 
is used for the PS technique, any pre-aggregated (or the original) data cube 
can be encoded using wavelets. In that sense wavelet transform and IDC are 
orthogonal techniques^. Once an appropriate Iterative Data Cube is selected, 
approximate answers to queries can be supported by encoding this IDC using 
wavelet transform. 



3 The Iterative Data Cubes Technique 

In this paper we focus on techniques for MOLAP data cubes that are handled 
similar to multidimensional arrays. The query cost is measured in terms of the 
number of cells that need to be accessed in order to answer the query. Similarly 
the update cost is measured as the number of cells of the pre-aggregated data 
cube whose values must be updated to reflect a single update on the data set. 
Since the data cubes are stored and accessed using multidimensional arrays, this 
cost model is realistic for both, internal (main memory) and external (disk, tape) 
algorithms. 

In general the IDC technique can be applied to an attribute whose domain 
forms an Abelian group under the aggregate operator. Stated differently, it can 
be applied to an aggregate operator © if there exists an inverse operator 0, such 
that for all attribute values a and b it holds that (a © 6) 0 6 = a (e.g., COUNT, 
but also AVG when expressed with “invertible” operators SUM and COUNT). For 
the sake of simplicity, the technique is described for the aggregate operator SUM 
and a measure attribute whose domain is the set of integers. 



3.1 Notation 

Let A be a data cube of dimensionality d, and let without loss of generality 
the domain of each dimension attribute Si be {0,1,..., Ui — 1}. A cell c = 



^ Note, however, that wavelet encoding typically increases the update cost. 
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[ci,...,Cd], where each c, is an element of the domain of the corresponding 
dimension, contains the measure value A[c\. With e : / we denote a region of 
the data cube, more precisely the set of all cells c that satisfy Ci < Ci < fi for 
all 1 < i < d (i.e., e : / is a hyper-rectangular region of the data cube). Cell e 
is the anchor and cell / the endpoint of the region. The anchor and endpoint 
of the entire data cube are [0, . . . , 0] and [n\ — 1 , . . . , — 1 ], respectively. The 

term op(A[e] : A[f]) denotes the result of applying the aggregate operator op to 
the values in region e : /. Consequently, SUM(A[e] : A[f]) is a range sum. The 
range sum SUM(A[0, . . . , 0] : A[f]) will be referred to as a prefix sum. 



3.2 Creating Iterative Data Cubes 

Iterative Data Cubes are constructed by applying one-dimensional pre- 
aggregation techniques along the dimensions. To illustrate this process, it is first 
described for one-dimensional data cubes and then generalized. Let 6 > be a one- 
dimensional pre-aggregation technique and A be the original one-dimensional 
data cube with n cells. Technique 0 generates a pre-aggregated array Aq of size 
n, such that each cell of Aq stores a linear combination of the cells of A: 

n—1 

VO < j < n - 1 : Ae[j] = '^ otj,kA[k] . ( 1 ) 

k=0 

The variables aj^k are real numbers that are determined by the pre-aggregation 
technique. Figure 1 shows an example. The array SRPS is the result of ap- 
plying the SRPS technique with block size 3 to the original array A. SRPS 
pre-aggregates a one-dimensional array as follows. A is partitioned into blocks of 
equal size. The anchor of a block a : e (its leftmost cell) contains the correspon- 
ding prefix sum of A, i.e., SRPS[a] = SUM(A[0] : ^[o]). Any other cell c of the 
block stores the “local prefix sum” SRPS[c] = SUM(A[a-|-l] : A[c\). Consequently, 
the coefficients in the example are Oq.o = cti.i = ck2,i = 0^2,2 = 0^3, fc = 1 

for 0 < fc < 3, a4^4 = 1 , 054 = as^s = 1 , oe.fc = 1 for 0 < fc < 6, and Uj^k = 0 
for all other combinations of j and k. 



Original array A 



|3|5|1|2|2|4|6|3|3| 
Query: l+2+2+4^9 


[3_^ 


I|1|2|2|4| 


|6 1 3 1 3 1 


Update: A[^ 


11=3 




1 ' ' 1 1 : 


h 


4 1 6 1 3 1 3 1 



SRPS array 
|3 I 5 I 6|ll| 2 I 6|23| 3 | ^ 

Query: (1 1+6)-(3+5)^9 
|3 I 5 I 6|ll| 2 I 6|:>3| 3 | ^ 



Update: A[4]=3 
|3 I 5 I 6|ll^7|24| 3 | ^ 



PS array 

|3 I 8 I 9 |11|13|17|23|26|291 

Query: 17-8=9 
|3 [8j 9 |ll|l3[l7j:>3|26|29l 

Update: A[4]^3 
|3 I 8 I 9 |ll]l4))8|24|27|30| 



Fig. 1. Original array A and corresponding SRPS (block size 3) and PS arrays (query 
range and updated cell are framed, accessed cells are shaded) 
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For a two-dimensional data cube A two (possibly different) one-dimensional 
pre-aggregation techniques 6>i and 02 are selected. 0i is first applied along 
dimension <5i, i.e., each row of A is pre-aggregated as described above. Let Ai 
denote the resulting pre-aggregated data cube. The columns of A\ are then 
processed using technique 02, returning the final pre-aggregated data cube A2. 
Figure 2 shows an example. For both dimensions the SRPS technique with block 
size 3 was selected. Note that applying the two-dimensional SRPS technique 
directly would generate the same pre-aggregated data cube. 




Original data cube: A 




Processing of the rows 
using SRPS: Al 



Processing of the columns 
using SRPS: A2 



Fig. 2. Original data cube A, intermediate cube Ai, and final SRPS cube A2 (fat lines 
indicate partitioning into blocks by SRPS) 



Generalizing the two-dimensional IDC construction to d dimensions is 
straightforward. First, for each dimension Si, I < i < d, a, one-dimensional tech- 
nique 0 i is selected. Then 0i is applied along dimension <5i, i.e., to each array 
[0, C2, C3, . . . , Cd] : [ni — 1, C2, C3, . . . , Cd] for any combination of Cj, 0 < cj < rij 
and j G {2, 3 , . . . , d} (intuitively only the first dimension value varies, while the 
others are fixed). Let the resulting pre-aggregated data cube be Ai. Each cell 
c = [ci , . . . , Cd] in Al now contains a linear combination of the values in the 
original array A which are in the same “row” along di. Formally, 

ni — 1 

-^1 [^ 1 1 ^2 ^d\ ^ ^ -^[^ 1 1 ^2 1 1 ^d\ • ( 2 ) 

ki—0 

Clearly Ai does not contain more cells than A (since 0i does not use additional 
space) and can be computed at a cost of H2 ■ ■ ■ ■ na ■ 0i(ni), where Ci{rii) 

denotes the cost of applying technique 0i to an array of size rii. In the next step 
technique 02 is similarly applied to dimension 82, but now with Ai, the result 
of the previous step, as the input data cube. For all cells in the resulting cube 
A2 it holds that 



n2 — 1 

-^2 [^1 j ^2 j • ■ • j ^ d \ ^ ^ ^ 2 ,C 2 ,^2 *^1 [^1 5 ^ 2 ; ^3 7 * ■ * ) ^ d \ 

k2—0 



(3) 
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712 — 1 Til — 1 

= ^ o;2,c2,fc2 ^ ^2, C3, . . . , Cd] (4) 

^2—0 /ci— 0 

Til — 1 712 — 1 

= E E ai,ci.fcia2,c2,fc2^[^i: fe>C3, . . . ,Cd] . (5) 

fcl— 0 ^2—0 

This process continues until all dimensions are processed. The final result, the 
pre-aggregated data cube A^, contains values which are the linear combination 
of the values in the original data cube. More precisely 

Til- 1 712 — 1 rid — I 

[^1 ; ^2 ; • ■ • 7 ^d\ ^ ^ ^ ^ ^ ^ ^l,ci ,fci ^2,C2 ,/u2 ’ ’ ’ ^d,Cd -^[^1 ? ^2 1 ■ ■ • j ■ 

fci— 0 ^2—0 kd—0 

( 6 ) 

The cost for processing dimension 5j is Cj{rij) ■ This results in a total 

construction cost of nf=i Cj(nj)/nj) which is equal to d times the 

size of the data cube if a one-dimensional pre-aggregation technique processes 
an array of size nj at cost rij. 



3.3 Querying an Iterative Data Cube 

Aggregate range queries as issued by a user or application select ranges on the 
original data cube. This data cube, however, was replaced by an Iterative Data 
Cube where cells contain pre-computed aggregate values. The query therefore 
needs to be translated to match the different contents. We will show that the pro- 
blem of querying a high-dimensional IDC can be reduced to the one-dimensional 
cases. 

Let 6> be a one-dimensional pre-aggregation technique, and let A and Aq 
denote the original and pre-aggregated data cubes, respectively. Technique O 
has to be complete in the sense that it must be possible to answer each range 
sum query on A by using Aq. Formally, for each range r on A there must exist 
coefficients Pr,i, such that 



71—1 

Y,A[j] = Y,(3rMl] (7) 

j^r l—O 

where the f3r^i are variables whose values depend on the pre-aggregation techni- 
que and the selected range. In the example in Fig. 1 the coefficients for SRPS 
(range r = 2 : 5) are Pr,o = Pr,i = —1, Pr ,3 = Pr ,5 = 1, and f3r,i = 0 for 
/G {2, 4, 6, 7, 8}. 

On a d-dimensional data cube A a range sum query selects a range for 
each dimension 6i. The answer Q to this query is computed as 

< 3 = E E ■■■ E • 

jd&rd jd-l&rd-i jl&ri 



(8) 
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Recall, that the pre-aggregated cube Ad for A was obtained by iteratively ap- 
plying one-dimensional pre-aggregation techniques, such that data cube Ai is 
computed by applying technique 0i along dimension 5i to Ai_i (let Aq = A). 
Consequently, range sum Q can alternatively be computed as 

ni — 1 

Q = 'Yl Y ■ ■ ■ X! ( ^ , jd]) (9) 

jd&rd jd-l&rd-i j 2 &r 2 h=0 

ni — 1 

= ^ /3l, ri.il ( ^ Y • • • X! • • • , jd]) (10) 

il=0 jd&Td jd-l&Td-l j2&T2 

n-i — l n2 — 1 

= ^ /3l, ri.il ^ Y^ ••• X! ^2.r2./2^2[^l,^2,j3,---,jd]) (11) 

h—0 jd^rd jd-i^rd-i I 2 —O 



ni — 1 n 2 — 1 rid — I 

— ^ ^ ^ ^ ''' ^ ^ Pl,ri,lip2,r2,l2 ' ' ' Pd,rd,ld^d[Wi^2i ■ • ■ i^d] • (1^) 

li—O I 2 —O ld—0 

The Pi,n,ii are well defined by the aggregation technique Oi and the selected 
range . There are no dependencies between the different dimensions in the sense 
that A.n.ii does not depend on the techniques 0j and the ranges Vj, if j ^ i. This 
enables the efficient decomposition into one-dimensional sub-problems. Note, 
that cell Ad[l\, ■ ■ ■ ,ld] of the pre-aggregated array Ad contributes to the query 
result Q if and only if the value of (3i,ri,h(32,r2,i2 ' ' ' Pd,rd,id is not zero. 

The query algorithm follows directly from the above discussion. For each di- 
mension 5i and range r^, the set of all li such that di,vi,ii is non-zero is determined 
independently of the other dimensions. Then, for each possible combination of 
non-zero /3i.n,ii, /32,r2.;2v • • i/^d.rd.Zd th® cell Ad[h,l 2 , ■ ■ ■ ,ld] has to be accessed 
and contributes its value, multiplied by /3i.ri,ii/32,r2,i2 ’ ‘ ‘ !3d,rd,idi lo the final re- 
sult Q of the range sum query. 

Figures shows an example for a query that computes SUM(A[2,4] : A[5,6]) 
on a two-dimensional pre-aggregated data cube (SRPS with box size 3 applied 
along both dimensions). First, for range 2 : 5 in dimension <5i and range 4 : 6 in 
dimension 5^ the indices with non-zero l3 values are obtained together with the 
/3s. Recall, that for range n = 2 : 5 we obtained the values /Ji.n.o = /^i.ri.i = — 1, 
/3i,ri,3 = Pi,ri ,5 = 1, and Pi,ri,h = 0 for li G {2,4, 6, 7, 8} (see above). Similarly, 
we obtain /32.r2.3 = -1> /32.r2.6 = 1, and P 2 ,r 2 ,h = 9 h G {0, 1, 2, 4, 5, 7, 8} 
for range r2 = 4 : 6 in dimension i52. Combining the results leads to the correct 
computation of SUM(A[2,4] : A[5,6]) as ^2(0, 3 ] — A2[0,6] -I- ^2(1, 3 ] — A2[l,6] — 
^2[3,3] -h ^ 2 ( 3 , 6 ] - A2[5,3] -h A2[5,6]. 

The query cost of IDC, i.e., the number of cells accessed in Ad, follows di- 
rectly from the algorithm. It is the product of the sizes of the sets of non-empty 
l3 values obtained for each dimension. As a consequence, once the worst case or 
average query cost of a one-dimensional technique is known, it is easy to com- 
pute the worst/average query cost for the d-dimensional pre-aggregated data 
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Query: SUM(A[2,4]:A[5,6]) 



Range in first dimension 




Values that are subtracted 



Query result computation: 
93+61-51-35-25-24+15+14=48 



Update: A[4,2] decreased from 3 to 1 



I 




Cells to be updated: [4,2], [4,3], [4,6], 
[5,2], [5,3], [5,6], [6,2], [6,3], [6,6] 



Fig. 3. Processing queries and npdates on an Iterative Data Cube (SRPS technique 
used for both dimensions) 



cube by multiplying the one-dimensional costs. In our example, one-dimensional 
SRPS allows each range sum to be computed from at most 4 values (computing 
SUM(4l[e] : A[f]) with SRPS requires at most accessing the anchor of the box 
that contains /, cell /, and, if e > 0, the anchor of the box that contains e — 1 
and cell e — 1). Consequently, independent of the selected ranges at most 4"^ cells 
in the d-dimensional pre-aggregated SRPS data cube have to be accessed. 

3.4 Updating an Iterative Data Cube 

For the original data cube, an update to the data set only affects a single cell. 
Since Iterative Data Cubes store pre-computed aggregates, such an update has 
to be translated to updates on a set of cells in the pre-aggregated data cube. 
The set of affected cells in follows directly from (6). Note that equations (6) 
and (12) are very similar, therefore the algorithms for processing queries and 
updates are almost identical. 

Equation (1) describes the dependencies between the pre-aggregated and the 
original data cube for the one-dimensional case. Clearly 4le[j] is affected by an 
update to A[k] if and only if aj^k ^ 0 (see Fig. 1 for an example). Based on (6) 
this can be generalized to d dimensions. Let [fci , . . . , be the cell in the original 
data cube A which is updated by a value A. For each dimension 5i, the set of all Ci 
such that ai^a,ki is non-zero is determined independently of the other dimensions. 
Then, for each possible combination of non-zero ct 2 ,c 2 M^- ■ ■ ^ot<i,cd,kd the 

cell 2ld[ci, C 2 , . . . , Cd] has to be updated by A ■ ai,ci,feiQ! 2 ,c 2 ,fc 2 ’ ’ ’ ad,cd,kd- 

Figures shows an example for an update that decreases the value A [4, 2] in 
the original data cube by 2. Recall, that SRPS with block size s = 3 was applied 
along both dimensions of the data cube and that the corresponding coefficients 
are o;i,o,o = c«i, 2 ,i = Q;i^ 2,2 = 1> CKi,3,fci = 1 for 0 < fci < 3, 
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ai,4,4 = 1 , Q;i, 5,4 = CKi,5,5 = 1) oiifiM ~ ^ 0 < fci < 6, and ai,ci,fci = 0 for all 

other combinations of c\ and k\ . In dimension <5i the updated cell has the index 
value 4, i.e., the relevant coefficients are ai,4,4, 01,5,4, and 01,5,4 which have 
the value 1, while all other oi,cj,4 are zero. Similarly the non-zero coefficients 
02,2,2 = 02,3,2 = 02,6,2 = 1 are obtained. Consequently, the cells [4,2], [4,3], 
[4, bj, [5,2], [5,3], [5,6], [6,2], [6,3], and [6,6] in A 2 have to be updated by 
l-(- 2). 

The update cost of IDC, i.e., the number of accessed cells in Ad, is the product 
of the sizes of the sets of non-empty a values obtained for each dimension. Thus, 
like for the query cost, once the worst case or average update cost of a one- 
dimensional technique is known, it is easy to compute the worst/average update 
cost for high-dimensional Iterative Data Cubes. This is done by multiplying the 
worst/average update costs of the one-dimensional techniques. 

4 IDC for Real-World Applications 

We present one-dimensional aggregation techniques and discuss how they are 
selected for pre-aggregating a high-dimensional data cube. The presented tech- 
niques mainly illustrate the range of possible tradeoffs between query and update 
cost. In the following discussion the original array is denoted with A and has 
n elements A[0], 4l[l],. . . ,A\n — 1]. The pre-aggregated array will be named like 
the corresponding generating technique. 

4.1 One-Dimensional Pre-aggregation Techniques 

The pre-aggregated array used for the PS technique [11] contains the prefix 
sums of the original array, i.e., PS[j] = Figure 1 shows an example 

for n = 9. Any range sum on A can be computed by accessing at most two values 
in PS (difference between value at endpoint and predecessor of anchor of query 
range). On the other hand, an update to A\k] affects all PS[j] where j > k. This 
results in worst case costs of 2 for a query and of n for an update. In Fig. 1 cells 
in PS which have to be accessed in order to answer SUM(A[2] : A[5]) and those 
that are affected by an update to A[A\ are shaded. 

The SRPS technique [17] (Fig. 1) was already introduced in Sect. 3.2. Its 
worst case costs are 4 for queries, and 2,/n (or 2,/n — 2 when n is a perfect 
square) for updates [17]. 

To compute the pre-aggregated array SDDC, the SDDC technique [17] first 
partitions the array A into two blocks of equal size. The anchor cell of each block 
stores the corresponding prefix sum of A. For each block, the same technique is 
applied recursively to the sub-arrays of non-anchor cells. The recursive partitio- 
ning defines a hierarchy, more precisely a tree of height less or equal to ]"log2 n ] , 
on the partitions (blocks). Queries and updates conceptually descend this tree. 
The processing starts at the root and continues to that block that contains the 
endpoint of the query or the updated cell, respectively. A query SUM(A[0] : A[(^) 
is answered by adding the values of the anchors of those blocks that contain c. 
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Due to the construction, at most one block per level can contain c, resulting 
in a worst case prefix sum query cost of |"log 2 n] . Queries with ranges [x] : [y] 
where x > 0 are answered as SUM([0] : [y]) — SUM([0] : [x — 1]). Thus the cost of 
answering any range sum query is bounded by 2 [log 2 n] . At each level an update 
to a cell u in a block U only propagates to those cells that have a greater or 
equal index than u and are an anchor of a block that has the same parent as U . 
Consequently, the update cost is bounded by the height of the tree ( |"log 2 n] ) . In 
Fig. 4 an example of an SDDC array and how the query SUM(A[2] : A[5]) and an 
update to A[4] are processed are shown. Note, that SDDC can be generalized by 
choosing different numbers of blocks and different block sizes when partitioning 
the data cube. This enables the technique to take varying attribute hierarchies 
into account. 



Original array A 



SDDC array 



LPS array 



Uk 


1 1 1 2 1 2 1 4 1 


6|3|3|1| 


[m 


1|8|2|17| 


6 1 3 I 12 I 1 1 


[Hi 


1 9|2 1 4 1 8 1 


14|3 1 6 1 7 1 


Query: l+2+2+4^9 






Query: 17-(3+5)^9 




Query: 8+(9-8)-9 


Lik 


lh|2|2|4| 


6|U3|1| 


nn. 


||1|8|2|17| 


6 1 3 |12| 1 1 


[Ik 


||9l2|4[8] 


U|3 1 6 1 7 1 



Update: A[4]=3 



Update: A[4]=3 



Update: A[4]=3 



|2| 


4 1 6 1 3 1 3 1 1 1 


00 


hi 


17|6 1 3 |12| 1 1 


|3 1 8 1 9|2 


5 



Fig. 4. Original array A and corresponding SDDC and LPS (si = 3, S2 = 4, S 3 = 3) 
arrays (query range and updated cell are framed, accessed cells are shaded) 



The Local Prefix Sum (LPS) technique partitions array A into t blocks of sizes 
Si, S 2 , . . . , S(, respectively. Any cell in the pre-aggregated array LPS contains a 
“local” prefix sum, i.e., the sum of its value and the values in its left neighbors 
until the anchor of the block it is contained in. A range query is answered by 
adding the values of all block endpoints that are contained in the query range, 
adding to it the value of the cell at the endpoint of the query range (if it is not 
an endpoint of a block) and subtracting the value of the cell left to the anchor 
of the query range (if it is not an endpoint of a block). Thus the query cost is 
bounded by f + 1. Figure 4 shows an example. Updates only affect cells with a 
greater or equal index than the updated cell in the same block, resulting in a 
worst case update cost of max{si, . . . , St}- For a certain t the query cost is fixed, 
but the worst case update cost is minimized by choosing si = S 2 = . . . = St. The 
corresponding family of (query cost, update cost) tradeoffs therefore becomes 
(t + 1, \n/t'\). 

The two techniques of using A directly or using its prefix sum array PS 
instead, constitute the extreme cases of minimal cost of updates and minimal 
cost of queries for one-dimensional data. Note, that it is possible to reduce the 
worst case query cost to 1. This, however, requires pre-computing and storing 
the result for any possible range query, i.e., ^{n+1) values. Also, since A[[n/ 2 J] 
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is contained in ([n/2j + l)(n — [n/2j) different ranges, the update cost for this 
scheme is at least n^/4. Since we focus on techniques that do not introduce space 
overhead, PS is the approach with the minimal query cost. Table 1 summarizes 
the query and update costs for selected one-dimensional techniques. 



Table 1. Query-update cost tradeoffs for selected one-dimensional techniques 



One-dimensional technique 


Query cost 

(worst case) 


Update cost 

(worst case) 


Note 


Original array 


n 


1 




Prefix Sum (PS) 


2 


n 




Space-Efficient Relative 
Prefix Sum (SRPS) 


4 


2^n- 2 


when n perfect square 


4 


2^/n 


otherwise 


Space-Efficient Dynamic 
Data Cube (SDDC) 


2[log2 nl 


[logs n'] 




Local Prefix Sum (EPS) 


t \ 


\n/t] 


2 < t < n 



4.2 Selecting an IDC for an Application 

The IDC technique provides a modular framework for choosing a suitable pre- 
aggregation scheme. It greatly simplifies taking advantage of a priori knowledge 
about an application. For instance, when it is known that a hierarchy exists for 
an attribute and that users typically query according to this hierarchy (e.g., it is 
more likely that a query aggregates monthly sales figures than sales figures for 
a 30-day period that starts in the middle of a month), one can set a correspon- 
ding block size for SDDC or SRPS. If a dimension has only a few values (e.g., 
gender), the best choice in most cases will be PS or not pre-aggregating along 
this dimension at all. Alternatively, if no appropriate technique is available, it 
is relatively easy to develop a new one and to integrate it into the framework. 
Recall, that all one has to do is to develop a one-dimensional technique and to 
analyze its cost tradeoffs. 

The process of selecting an appropriate IDC is illustrated with a hypothe- 
tical example. Assume that the data cube has three dimensions of size n each, 
and a fourth dimension of size 2 (e.g., gender). Two of the three attributes with 
dimension size n are hierarchical and it is likely that users query according to 
the hierarchies. For simplicity assume further that both hierarchies are similar 
to a balanced binary tree. Apart from that, the query cost has to be small, but 
frequent updates are expected. Then the best choice for the two hierarchical 
attributes is the SDDC technique (depending on the actual hierarchical struc- 
ture, variations of SDDC can be used) . It guarantees a sublinear query and up- 
date cost and provides good expected costs for queries that aggregate according 
to the hierarchies. For the dimension of size 2 pre-aggregation is unnecessary. 
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The remaining dimension is processed with PS to enable fast queries. In to- 
tal, the worst case costs are 2 log 2 n ■ 2 log 2 n • 2 • 2 = 16 log 2 n for queries and 
log 2 n ■ log 2 n • 1 • n = n log 2 n for updates. Note that all costs are exact, i.e., there 
are no hidden constants. 



5 Comparing IDC to Previous Approaches 

The IDC technique reduces the problem of pre-aggregating d-dimensional data 
cubes to the one-dimensional case. Compared to techniques that directly solve 
the d-dimensional problem, IDC’s range of possible query-update cost tradeoffs 
is therefore restricted. However, as we will show below, none of the previously 
proposed d-dimensional pre-aggregation techniques obtains superior tradeoffs. 

The PS, SRPS, and SDDC techniques constitute special cases of IDC. One 
can iteratively create the pre-aggregated data cubes for these techniques by ap- 
plying the corresponding one-dimensional technique for each dimension of the 
original data cube. This results in d-dimensional Iterative Data Cubes with worst 
case (query, update)-cost pairs (2‘^, n'^), 2‘^n'^/^), and ( 2 '^|"log 2 nY, [log 2 

respectively. As an interesting by-product PS, SRPS, and SDDC can be analy- 
zed and implemented as Iterative Data Cubes. Note that for SRPS and SDDC 
the implementation is quite complex and the analysis difficult. For instance for 
a d-dimensional data cube, SDDC stores (d — 1 (-dimensional surfaces of pre- 
aggregated cumulative values recursively as (d—1 (-dimensional data cubes. Thus, 
IDC provides a great “tool” for verifying the results of these previous approaches 
and for obtaining new results, like for instance average case costs. 

The HC technique [3] generates a pre-aggregated data cube by hierarchically 
partitioning the original data cube A into smaller hyper-rectangles (blocks( of 
equal size. The number of recursive partitioning steps determines the height of 
a Hierarchical Cube. Hierarchical Rectangle Cubes (HRC( with a height of one 
are identical to the original data cube. In HRCs of height two each cell stores the 
prefix sum local to the anchor of the block it belongs to. Consequently, any HRC 
of height two can be constructed iteratively by applying the one-dimensional LPS 
technique with the corresponding block sizes along each dimension. Hierarchical 
Rectangle Cubes of height one and two hence are generalized by IDC. For HRCs 
of height greater than two, [3] does not provide analytical or experimental results. 
Thus we were not able to compare IDC to HRCs of height greater than two. 
Hierarchical Band Cubes (HBC( can not be generalized by our technique. Only 
HBCs of height one are identical to the PS cube, which is an IDC. For HBCs of 
height greater than one we prove, that no matter which hierarchical partitioning 
scheme is used, a d-dimensional HBC has always a worst case update cost of 
at least n‘^~^ (we assume without loss of generality that all dimensions have a 
domain of size n(. The proof can be found in [15]. The range of possible update 
costs therefore is restricted compared to IDC. In total, the best possible HBC 
cube of height h>2 has a worst case query cost of at least 2‘^h = 2^^+^ [3], and 
a worst case update cost of at least An Iterative Data Cube where the 

PS technique is used for (d — 2( dimensions and the SRPS technique is used for 
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the remaining two dimensions, has respective worst case query and update costs 
of = 2*^+^ and n‘^~^{2^/n)'^ = Thus, there exists an IDC whose 

query and update costs are asymptotically identical to the lower hounds for the 
corresponding costs of any HBC cube. 

6 Conclusion 

IDC is the first pre-aggregation technique on data cubes that can take the spe- 
cific properties of different dimension attributes into account. Instead of solving 
a d-dimensional pre-aggregation problem directly, the different dimensions are 
handled independently. This greatly simplifies the development, analysis, and 
implementation compared to earlier approaches. At the same time a greater 
variety of query-update cost tradeoffs can be generated. Thus Iterative Data 
Cubes provide a practical framework for developing pre-aggregation techniques 
for MOLAP data cubes. 

Even though the space of possible pre-aggregation schemes is restricted by the 
iterative combination process, we were able to show that the query-update cost 
tradeoffs of previously proposed techniques are matched. It remains, however, 
as an open problem, to show that in general the query-update cost tradeoffs 
that are optimal for the IDC technique are also optimal with respect to any 
pre-aggregation technique on a high-dimensional data cube. We will pursue this 
problem, as well as the problem of sparse data sets [16] in our future research. 
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Abstract. Many data mining approaches focus on the discovery of si- 
milar (and frequent) data values in large data sets. We present an al- 
ternative, but complementary approach in which we search for empty 
regions in the data. We consider the problem of finding all maximal 
empty rectangles in large, two-dimensional data sets. We introduce a 
novel, scalable algorithm for Ending all such rectangles. The algorithm 
achieves this with a single scan over a sorted data set and requires only a 
small bounded amount of memory. We also describe an algorithm to find 
all maximal empty hyper-rectangles in a multi-dimensional space. We 
consider the complexity of this search problem and present new bounds 
on the number of maximal empty hyper-rectangles. We briefly overview 
experimental results obtained by applying our algorithm to a synthetic 
data set. 



1 Introduction 

Much work in data mining has focused on characterizing the similarity of data 
values in large data sets. This work includes clustering or classification in which 
different techniques are used to group and characterize the data. Such techniques 
permit the development of more “parsimonious” versions of the data. Parsimony 
may be measured by the degree of compression (size reduction) between the 
original data and its mined characterization [3] . Parsimony may also be measured 
by the semantic value of the characterization in revealing hidden patterns and 
trends in the data [8,1]. 

Consider the data of Figure 1 representing information about traffic infrac- 
tions (tickets), vehicle registrations, and drivers. Using association rules, one 
may discover that Officer Seth gave out mostly speeding tickets [1] or that dri- 
vers of BMWs usually get speeding tickets over $100 [11]. Using clustering one 
may discover that many expensive (over $500) speeding tickets were given out 
to drivers of BMW’s [14]. Using fascicles, one may discover that officers Seth, 
Murray and Jones gave out tickets for similar amounts on similar days [8]. 

The data patterns discovered by these techniques are defined by some mea- 
sure of similarity (data values must be identical or similar to appear together in 
a pattern) and some measure of degree of frequency or occurrence (a pattern is 
only interesting if a sufficient number of data values manifest the pattern or, in 
the case of outlier detection, if very few values manifest the pattern). 

In this paper, we propose an alternative, but complementary approach to 
characterizing data. Specifically, we focus on finding and characterizing empty 
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regions in the data. In the above data set, we would like to discover if there 
are certain ranges of the attributes that never appear together. For example, 
it may be the case that no tickets were issued to BMW Z3 series cars before 
1997 or that no tickets for over $1,000 were issued before 1990 or that there is 
no record of any tickets issued after 1990 for drivers born before 1920. Some of 
these empty regions may be foreseeable (perhaps BMW Z3 series cars were first 
produced in 1997). Others may have more complex or uncertain causes (perhaps 
older drivers tend to drive less and more defensively). 

Clearly, knowledge of empty regions may be valuable in and of itself as it 
may reveal unknown correlations between data values which can be exploited in 
applications.^ For example, if a DBA determines that a certain empty region is 
a time invariant constraint, then it may be modeled as an integrity constraint. 
Knowing that no tickets for over $1,000 were issued before 1990, a DBA of a re- 
lational DBMS can add a check constraint to the Tickets table. Such constraints 
have been exploited in semantic query optimization [5]. 

To maximize the use of empty space knowledge, our goal in this work is to not 
only find empty regions in the data, but to fully characterize that empty space. 
Specifically, we discover the set of all maximal empty rectangles. In Section 2, 
we formally introduce this problem and place our work in the context of related 
work from the computational geometry and artificial intelligence communities. 
In Section 3, we present an algorithm for finding the set of all maximal empty 
rectangles in a two-dimensional data set. Unlike previous work in this area, we 
focus on providing an algorithm that scales well to large data sets. Our algorithm 
requires a single scan of a sorted data set and uses a small, bounded amount 
of memory to compute the set of all maximal empty rectangles. In contrast, 
related algorithms require space that is at least on the order of the size of the 
data set. We describe also an algorithm that works in multiple dimensions and 
present complexity results along with bounds on the number of maximal hyper- 
rectangles. In Section 4, we present the results of experiments performed on 
synthetic data showing the scalability of our mining algorithm. We conclude in 
Section 5. 



^ [10] describe applications of such correlations in a medical domain. 
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2 Problem Definition and Related Work 

Consider a data set D consisting of a set of tuples (vx,Vy) over two totally 
ordered domains. Let X and Y denote the set of distinct values in the data set 
in each of the dimensions. We can depict the data set as an |Jf| x |y| matrix 
M of O’s and I’s. There is a 1 in position {x,y) of the matrix if and only if 
{vx,Vy) G D where Vx is the smallest value in X and Vy the smallest in 

r. 

An empty rectangle is maximal ^ if it cannot be extended along either the 
X or Y axis because there is at least one 1-entry on each of the borders of the 
rectangle. Although it appears that there may be a huge number of overlapping 
maximal rectangles, [12] proves that the number is at most and that 

for a random placement of the 1-entries the expected value is 0{\D\ log jUj) [12]. 
We prove that the number is at most the number of 0-entries, which is 0(|A||Y|) 
(Theorem 4). 

A related problem attempts to find the minimum number of rectangles (either 
overlapping or not) that covers all the O’s in the matrix. This problem is a 
special case of the problem known as Rectilinear Picture Compression and is 
NP-complete [7]. Hence, it is impractical for use in large data sets. 

The problem of finding empty rectangles or hyper-rectangles has been stu- 
died in both the machine learning [10] and computational geometry literature 

[12.2.4.13] . Liu et al motivate the use of empty space knowledge for discovering 
constraints (in their terms, impossible combinations of values) [10]. However, the 
proposed algorithm is memory-based and not optimized for large datasets. As the 
data is scanned, a data structure is kept storing all maximal hyper-rectangles. 
The algorithm runs in O df {log |7?|)^) where d is the number of dimen- 
sions in the data set. Even in two dimensions {d = 2) this algorithm is impractical 
for large datasets. In an attempt to address both the time and space complexity, 
the authors propose only maintaining maximal empty hyper-rectangles that ex- 
ceed an a priori set minimum size. This heuristic is only effective if this minimum 
size is set sufficiently small. Furthermore, as our experiments on real dataset will 
show, for a given size, there are typically many maximal empty rectangles that 
are largely overlapping. Hence, this heuristic may yield a set of large, but almost 
identical rectangles. This reduces the effectiveness of the algorithm for a large 
class of data mining applications where the number of discovered regions is less 
important that the distinctiveness of the regions. Other heuristic approaches 
have been proposed that use decision tree classifiers to (approximately) separate 
occupied from unoccupied space then post-process the discovered regions to de- 
termine maximal empty rectangles [9]. Unlike our approach, these heuristics do 
not guarantee that all maximal empty rectangles are found. 

This problem has also been studied in the computational geometry literature 

[12.2.4.13] where the primary goal has been to produce run time bounds. These 
algorithms find all maximal empty rectangles in time 0(|L)| log |L)| -I- s) and 
space 0{\D\), where \D\ is the size of the data set and s denotes the number 

Do not confuse maximal with maximum (largest). 
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of maximal empty rectangles. Such algorithms are particularly effective if the 
data set is very sparse and there happens to be only a few maximal rectangles. 
However, these algorithms do not scale well for large data sets because of their 
space requirements. The algorithms must continually access and modify a data 
structure that is as large as the data set itself. Because in practice this will not 
fit in memory, an infeasible amount of disk access is required on large data sets. 

The setting of the algorithm in [13] is different because it considers points 
in the real plane instead of 1-entries in a matrix. The only difference that this 
amounts to is that they assume that points have distinct X and Y coordinates. 
This is potentially a problem for a database application since it would not allow 
any duplicate values in data (along any dimension). 

Despite the extensive literature on this problem, none of the known algo- 
rithms are effective for large data sets. Even for two-dimensional data sets, the 
only known technique for scaling these algorithms is to provide a fixed bound on 
the size of the empty rectangles discovered, a technique which severally limits 
the application of the discovered results. 

Our first contribution to this problem is an algorithm for finding all maximal 
empty rectangles in a two-dimensional space that can perform efficiently in a 
bounded amount of memory and is scalable to a large, non-memory resident data 
set. Unlike the algorithm of [10], our algorithm requires the data be processed 
in sorted order. However, sorting is a highly optimized operation within modern 
DBMS and by taking advantage of existing scalable sorting techniques, we have 
produced an algorithm with running time 0(|Y||Y|) (i.e. linear in the input 
size) that requires only a single scan over the sorted data. Furthermore, the 
memory requirements are 0(|Y|), which is an order of magnitude smaller than 
the size 0(|Y||Y|) of both the input and the output. (We assume without loss 
of generality that \X\ < |Y|.) If the memory available is not sufficient, our 
algorithm could be modified to run on a portion of the matrix at a time at 
the cost of extra scans of the data set. Our second main contribution is an 
extension of our algorithm to find all maximal empty hyper-rectangles in multi- 
dimensional data. The space and time trade-off compare favorably to those of 
the heuristic algorithm of [10] (the time complexity of our extended algorithm is 
and the space requirements are 0{tP\D\‘^~^)), but are worse than 
those of incomplete classifier-based algorithms [9]. 

3 Algorithm for Finding All Maximal Empty Rectangles 

This section presents an elegant algorithm for finding all maximal empty regions 
within a two dimensional data set. Although the binary matrix M representation 
of the data set D is never actually constructed, for simplicity we describe the 
algorithm completely in terms of M. In doing so, however, we must insure that 
only one pass is made through the data set D. 

The main structure of the algorithm is to consider each 0-entry (x, y) of M 
one at a time row by row. Although the 0-entries are not explicitly stored, this is 
simulated as follows. We assume that the set X of distinct values in the (smaller) 
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dimension is small enough to store in memory. The data set D is stored on disk 
sorted with respect to Y,X. Tuples from D will be read sequentially off the disk 
in this sorted order. When the next tuple {vx, Vy) G D is read from disk, we will 
be able to deduce the block of 0-entries in the row before this 1-entry. 

When considering the 0-entry {x,y), the algorithm needs to look ahead by 
querying the matrix entries {x + l,y) and {x, y -I- 1). This is handled by having 
the single pass through the data set actually occur one row in advance. This 
extra row of the matrix is small enough to be stored in memory. Similarly, 
when considering the 0-entry {x,y), the algorithm will have to look back and 
query information about the parts of the matrix already read. To avoid re- 
reading the data set, all such information is retained in memory. This consists of 
stair case{x —l,y), which is a stack of at most n tuples {xi, yi), a row of indexes 
yr{x',y — 1) for x' G [l..n], and a single index x*(a; — l,y). 

The main data structure maintained by the algorithm is the maximal stair- 
case, stair case{x,y), which stores the shape of the maximal staircase shaped 
block of 0-entries starting at entry {x, y) and extending up and to the left as far 
as possible. See Figure 2. Note that the bottom-right entry separating two steps 
of the staircase is a 1-entry. This entry prevents the two adjoining steps from 
extending up or to the left and prevents another step forming between them. 



1 

1 




Fig. 2. The maximal staircase for (: 



loop y = 1 ... n 
loop X = 1 ... m 

(I) Construct staircase (x,y) 

(II) Output all maximal 0-rectangles with <x,y> 

as the bottom-right comer 



,y). Fig. 3. Algorithm Structure. 



The purpose of constructing the staircase{x, y) is to output all maximal 
rectangles that lie entirely within that staircase and whose bottom right corner 
is (x,y). The algorithm (Figure 3) traverses the matrix left-to-right and top-to- 
bottom creating a staircase for every entry in the matrix. We now describe the 
construction of the staircase and the production of maximal empty rectangles in 
detail. 

3.1 Constructing Staircase{x, y) 

The maximal staircase, staircase{x, y), is specified by the coordinates of the top- 
left corner {xi,yi) of each of its steps. This sequence of steps {{x\,yi ),. . ., {xr, yr)) 
is stored in a stack, with the top step (xr,yr) on the top of the stack. The 
maximal staircase, staircase{x, y) = {{x\,yi ) , . . . , {xr, yr)), is easily constructed 
from the staircase, stair case{x—l,y) = {{x[,y[) , . . . , {x'^, ,y'^,)) as follows. See 
Figure 4. 

We start by computing yr, which will be the Y-coordinate for the highest 
entry in staircase{x,y). This step extends up from {x,y) until it is blocked by 
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the first 1 in column x. Searching for this 1 entry takes too much time. Instead, 
yr(x, y) will be computed in constant time from yr{x, y—1), which we saved from 
the {x,y—l) iteration. Here, yr{x,y) is used to denote y^ to distinguish between 
it and the same value for different iterations. By definition yr{x,y) is the Y- 
coordinate of the top most 0-entry in the block of 0-entries in column x starting 
at entry {x,y) and extending up. yr{x,y —1) is the same except it considers 
the block extending up from {x,y—l). Therefore, if entry {x,y) contains a 0, 
then yr{x,y) = yr{x,y—l). On the other hand, if entry {x,y) contains a 1, then 
yr{x,y) is not well defined and stair case{x,y) is empty. 

How the rest of staircase{x, y) is constructed depends on how the new height 
of top step yr compares with the old one y'^, . 




Fig. 4. The three cases in constructing maximal staircase, stair case{x,y), from 
stair case{x—l,y). 



Case yr < Figure 4(a). If the new top step is higher than the old top step, 
then the new staircase staircase{x, y) is the same as the old one staircase{x— 
l,y) except one extra high step is added on the right. This step will have 
width of only one column and its top-left corner will be {x, yr). In this case, 
staircase{x, y) is constructed from staircase{x—l, y) simply by pushing this 
new step (x, yr) onto the top of the stack. 

Case yr = y),: Figure 4(b). If the new top step has the exact same height as 
the old top step, then the new staircase staircase{x, y) is the same as the old 
one stair case{x — l,y) except that this top step is extended one column to 
the right. Because the data structure stair case{x,y) stores only the top-left 
corners of each step, no change to the data structure is required. 

Case yr > y),: Figure 4(c). If the new top step is lower then the old top step, 
then all the old steps that are higher then this new highest step must be 
deleted. The last deleted step is replaced with the new highest step. The 
new highest step will have top edge at yr and will extend to the left as 
far as the last step (x',,y',) to be deleted. Hence, top- left corner of this 
new top step will be at location In this case, stair case{x,y) is 

constructed from stair case{x — l,y) simply by popping off the stack the 
steps {x'r,,y'r,) ,{xr,_i,y'r,_i) ,...,{x),,y),) as long as yr > y[. Finally, the 
new top step {x),,yr) is pushed on top. 
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One key thing to note is that when constructing staircase{x, y) from staircase 
(x—l,y), at most one new step is created. 



3.2 Outputting the Maximal 0-Rectangles 

The goal of the main loop is to output all maximal 0-rectangles with {x, y) as 
the bottom-right corner. This is done by outputting all steps of staircase{x, y) 
that cannot be extended down or to the right. 

Whether such a step can be extended depends on where the first 1-entry is 
located within row j/-|-l and where it is within column a; -1-1. Consider the largest 
block of 0-entries in row y+1 starting at entry (a;,j/-|-l) and extending to the 
left. Let X:t{x,y) (or x* for short) be the X-coordinate of this left most 0-entry 
(see Figure 5). Similarly, consider the largest block of 0-entries in column x+1 
starting at entry {x+l,y) and extending up. Let y^,{x,y) (or t/* for short) be 
the Y-coordinate of this top most 0-entry. By definition, y,(x,y) = yr{x+l,y) 
and we know how to compute it. cc»(a;,?/) is computed in constant time from 
x*(x— l,t/) in the same way. (See Figure 6.) 

The following theorem states which of the rectangles within the staircase 
(x,y) are maximal. 

Theorem 1. Consider a step in staircase{x, y) with top-left corner {xi, yf). The 
rectangle {xi,x,yi,y) is maximal if and only if Xi < x* and yi < y*. 

Proof. The step {xi^yf) of stair case{x,y) forms the rectangle {xi,x,yi,y) . If 
Xi > X*, then this rectangle is sufficiently skinny to be extended down into the 
block of 0-entries in row y-\-l. For example, the highest step in Figure 5 satisfies 
this condition. On the other hand, if Xi < x*, then this rectangle cannot be 
extended down because it is blocked by the 1-entry located at (x, — 1, J/-I-1). 
Similarly, the rectangle is sufficiently short to be extended to the right into the 
block of 0-entries in column x-|-l only if yi > j/*. See the lowest step in Figure 5. 
Hence, the rectangle is maximal if and only if Xj < x* and yi < y*. 

To output the steps that are maximal 0-rectangles, pop the steps l^r,yr)i 
(xr-i, j/r-i) , • • ■ from the stack. The Xi values will get progressively smaller and 
the yi values will get progressively larger. Hence, the steps can be divided into 
three intervals. At first, the steps may have the property Xj > x*. As said, 
these steps are too skinny to be maximal. Eventually, the Xi of the steps will 
decrease until Xi < x,. Then there may be an interval of steps for which Xj < x* 
and yi < y*. These steps are maximal. For these steps output the rectangle 
(xi,x,yi,y). However, yi may continue to increase until yi > y*. The remaining 
steps will be too short to be maximal. 

Recall that the next step after outputting the maximal steps in staircase{x, y) 
is to construct staircasd^x-\-l, y) from staircase{x, y). Conveniently, the work re- 
quired for these two operations is precisely the same. This is because yr{x-\-l, y) = 
t/*(x,t/). The steps from the first and second interval, i.e. yi < j/*(x,?/) = 
j/r.(x-|-l, y), can be thrown way as they are popped off the stack, because they are 
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precisely the steps that are thrown away when constructing staircase{x+l,y) 
from stair case{x,y). Similarly, the staircase steps in the third interval, i.e. 
Vi > y*{x,y) = yr{x + l,y), do not need to be popped, because they are not 
maximal in stair case{x,y) and are required for staircase{x+l,y). 




x^x-l,y) = x|x,y) 



H(x-l,y), xjx,y) = inf 



y+1 1 onoonni o 

X-1 X 



1 OQOQQOl 1 
x-1 X 



Fig. 5. Computing staircase{x, y). Fig. 6. Computing Xt(x, y) from xJ^x—1, y). 



3.3 Time and Space Complexity 

Theorem 2. The time complexity of the algorithm is 0{nm). 

Proof. Most of the algorithm is clearly constant time per {x, y) iteration of the 
main loop. We have already described how to compute yr{x,y), a;*(x,y), and 
t/,(a;,t/) in constant time. The remaining task that might take more time is 
popping the steps off the stack to check if they are maximal, deleting them if 
they are too skinny, and outputting and deleting them if they are maximal. For 
a particular {x, y) iteration, an arbitrary number of steps may be popped. This 
takes more than a constant amount of time for this iteration. However, when 
amortized over all iterations, at most one step is popped per iteration. 

Consider the life of a particular step. During some iteration it is created and 
pushed onto the stack. Later it is popped off. Each (x,y) iteration creates at 
most one new step and then only if the {x, y) entry is 0. Hence, the total number 
of steps created is at most the number of 0-entries in the matrix. As well, because 
each of these steps is popped at most once in its life and output as a maximal 
0-rectangle at most once, we can conclude that the total number of times a step 
is popped and the total number of maximal 0-rectangles are both at most the 
number of 0-entries in the matrix. 

It follows that the entire computation requires only 0{nm) time (where n = 
\X\ and m = |F|). 



Theorem 3. The algorithm requires 0(min(n, to)) space. 
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Proof. If the matrix is too large to fit into main memory, the algorithm is such 
that one pass through the matrix is sufficient. Other than the current {x,y)~ 
entry of the matrix, only 0(min(n, to)) additional memory is required. The stack 
for staircase{x, y) contains neither more steps than the number of rows nor 
more than the number of columns. Hence, \staircase{x,y)\ = 0(min(n, to)). 
The previous value for x* requires 0(1) space. The previous row of y* values 
requires 0{n) space, but the matrix can be transposed so that there are fewer 
rows than columns. 

3.4 Number and Distribution of Maximal 0-Rectangles 
Theorem 4. The number of maximal 0-rectangles is at most 0{nm). 

Proof. Follows directly from the proof of Theorem 2. 

We now demonstrate two very different matrices that have 0{nm) maximal 

0- rectangles. See Figure 7. The first matrix simply has 0{nm) 0-entries each of 
which is surrounded on all sides by 1-entries. Each such 0-entry is in itself a 
maximal 0-rectangle. 

For the second construction, consider the n by n matrix with two diagonals of 

1- entries. One from the middle of the left side to the middle of the bottom. The 
other from the middle of the top to the middle of the right side. The remaining 
entries are 0. Choose any of the | 0-entries along the bottom 1-diagonal and any 
of the ^ 0-entries along the top 1-diagonal. The rectangle with these two 0-entries 
as corners is a maximal 0-rectangle. There are 0(ri^) of these. Attaching — of 
these matrices in a row will give you an n by to matrix with ^O(n^) = 0(nm) 
maximal 0-entries. 



I 

ifi im ' 



1 



1^1 
i[^i 



‘ i[|]i 

1 

1 imi 
1[Q]1 T 



Fig. 7. Two matrices with 0{nm) maximal 0-rectangles 



Actual data generally has structure to it and hence contains large 0-rectang- 
les. We found this to be true in all our experiments [6]. However, a randomly 
chosen matrix does not contain large 0-rectangles. 

Theorem 5. Let M be a n x n matrix where each entry is chosen to be 1 
independently at random with probability a\ = ^. The expected number of T 
entries is Ni. The probability of it having a 0-rectangle of size s is at most 
p = (1 — ai)^n^ and the expected number of disjoint 0-rectangle of size s is at 
least E = {1 — aiYn^ / s. 
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Proof. A fixed rectangle of size s obtains all O’s with probability (1 — ai)®. 
There are at most different rectangles of size s in an n x n matrix. Hence, the 
probability that at least one of them is all O’s is at most p = {1 — The 

number of disjoint square rectangles within a n x n matrix is j s. The expected 
number of these that are all O’s is if = (1 — ai)®n^/s. 

Example 1. If the density of I’s is only ai = then the probability of having 
a 0-rectangle of size s = ^[31n(n) -I- ln(^)] = 3, OOOln(n) -I- 7,000 is at most 
p = ^ QpQ and the expected number of 0-rectangle of size s = -^[2ln{n) — 
Inln(n) — ln(^)] = 2,0001n(n) — lOOOlnln(n) — 14,500 is at least 1000. 

As a second example, the probability of having a 0-rectangle of size s = q ■ 
is at most p = when the number of I’s is at least Ni = ai'nf = 
i[31n(n) -I- ln(i)] = 3, OOOln(n) -I- 7, 000. The expected number of this size is at 
least E = 100 when the number of I’s is at most N\ = ^ = 2, 300. 

The expected number of rectangles can be derived as a consequence of Theo- 
rem 5 as E{s) < 0(min(Afi log iVi, Aq)) (where Ni is the number of 1-entries 
and Nq the number of 0-entries). This value increases almost linearly with N\ as 
A^i log Ni until N\ log N\ = Nq = v? /2 and then decreases linearly with v? — Ni. 

3.5 Multi-dimensional Matrices 

The algorithm that finds all maximal 0-rectangles in a given two dimensional 
matrix can be extended to find all maximal d-dimensional 0-rectangles within a 
given d-dimensional matrix. In the 2-dimensional case, we looped through the 
entries (x, y) of the matrix, maintaining the maximal staircase, staircase{x, y) 
(see Figure 2). This consists of a set of steps. Each such step is a 0-rectangle 
(xi,x,yi,y) that cannot be extended by decreasing the Xi or the yt coordina- 
tes. There are at most 0{n) such “stairs”, because their lower points (xi,yi) 
lie along a 1-dimensional diagonal. In the 3-dimensional case, such a maximal 
staircase, staircase{x,y, z) looks like one quadrant of a pyramid. Assuming (for 
notational simplicity) that every dimension has size n, then there are at most 
0{iT?) stairs, because their lower points (xy yi, zf) lie along a 2-dimensional dia- 
gonal. In general, staircase{x\, X 2 , ■ . -Xd) consists of the set of at most 
rectangles (steps) that have (xi, X 2 , . . . xf) as the upper corner and that cannot 
be extended by decreasing any coordinate. 

In the 2-dimensional case, we construct staircase{x, y) from staircase{x—l, y) 
by imposing what amounts to a 1-dimensional staircase on to its side (see Fi- 
gure 4). This 1-dimensional staircase consists of a single step rooted at (x, y) and 
extending in the y dimension to y^. It was constructed from the 1-dimensional 
staircase rooted at (x,y— 1) by extending it with the 0-dimensional staircase 
consisting only of the single entry (x,y). The 1-dimensional staircase rooted at 
(x,y— 1) had been constructed earlier in the algorithm and had been saved in 
memory. The algorithm saves a line of n such 1-dimensional staircases. 




184 



J. Edmonds et al. 



In the 3-dimensional case, the algorithm saves the one 3-dimensional stair- 
case staircase{x — l,y, z), a line of n 2-dimensional staircases, and a plane 
of n? 1-dimensional staircases, and has access to a cube of 0-dimensional 
staircases consisting of the entries of the matrix. Each iteration, it constructs 
the 3-dimensional staircase{x, y, z) from the previously saved 3-dimensional 
staircase{x—l, y, z) by imposing a 2-dimensional staircase on to its side. This 2- 
dimensional staircase is rooted at {x, y, z) and extends in the y, z plain. It is con- 
structed from the previously saved 2-dimensional staircase rooted at {x, y—1, z) 
by imposing a 1-dimensional staircase on to its side. This 1-dimensional staircase 
is rooted at (x,y,z) and extends in the z dimension. It is constructed from the 
previously saved 1-dimensional rooted at {x, y, z—1) by imposing a 0-dimensional 
staircase. This 0-dimensional staircase consists of the single entry (x,y,z). This 
pattern is extended for the d-dimensional case. 

The running time, 0{Nq d is dominated by the time to impose the 

d— 1-dimensional staircase onto the side of the d-dimensional one. With the right 
data structure, this can be done in time proportional to the size of the d — 1- 
dimensional staircase, which as stated is 0{d Doing this for every 0-entry 

{x, y, z) requires a total of 0{Nq d n‘^^) time. 

When constructing staircase{x,y, z) from stair case{x — l,y, z) some new 
stairs are added and some are deleted. The deleted ones are potential maximal 
rectangles. Because they are steps, we know that they cannot be extended by 
decreasing any of the dimensions. The reason that they are being deleted is 
because they cannot be extended by increasing the x dimension. What remains 
to be determined is whether or not they can be extended by increasing either 
the y or the z dimension. In the 2-dimensional case, there is only one additional 
dimension to check and this is done easily by reading one row ahead of the current 
entry (x,y). In the 3-dimensional case this is harder. One possible solution is to 
read one y, z plane ahead. An easier solution is as follows. 

The algorithm has three phases. The first phase proceeds as described above 
storing all large 0-rectangles that cannot be extended by decreasing any of the 
dimensions (or by increasing the x dimension). The second phase turns the 
matrix upside down and does the same. This produces all large 0-rectangles that 
cannot be extended by increasing any of the dimensions (or by decreasing the x 
dimension). The third phase finds the intersection of these two sets by sorting 
them and merging them together. These rectangles are maximal because they 
cannot be extended by decreasing or by increasing any of the dimensions. This 
algorithm makes only two passes through the matrix and uses only 0{d n'^^) 
space. 

Theorem 6. The time complexity of the algorithm is 0 {Nq d < 0{d 
and space complexity 0{d 

Theorem 7. The number of maximal 0 -hyper-rectangles in a d-dimensional ma- 
trix is 0(n^‘^“^). 

Proof. The upper bound on the number of maximal rectangles is given by the 
running time of the algorithm that produces them all. The lower bound is proved 
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by constructing a matrix that has such rectangles. The construction is 

very similar to that for the d = 2 case given in Figure 7, except that the lower 
and the upper diagonals each consist of a d— 1-dimensional plain of points. 
Taking most combinations of one point from the bottom plane and one from the 
top produces x maximal rectangles. 

The number of such maximal hyper-rectangles and hence the time to produce 
them increases exponentially with d. For d = 2 dimensions, this is 0{n^), which 
is linear in the size 0(n^) of the input matrix. For d = 3 dimensions, it is already 
6*(n^), which is not likely practical in general for large data sets. 

4 Performance of Mining Algorithm 

In this section, we present experiments designed to verify the claims of the algo- 
rithm’s scalability and usefulness on large data sets. These tests were run against 
synthetic data. The experiments reported here were run on an (admittedly slow) 
multi-user 42MHz IBM RISC System/6000 machine with 256 MB RAM. 

The performance of the algorithm depends on the number of tuples T = \D\ 
(which in matrix representation is the number of 1-entries), the number n of 
distinct values of X, the number m of distinct values of Y, and the number of 
maximal empty rectangles R. We report the runtime and the number of maximal 
empty rectangles R (where applicable). 

Effect of T, the number of tuples, on Runtime. To test the scalability of the 
algorithm with respect to the data set size, we held the data density (that is, the 
ratio of T to n*m) constant at one fifth. We also held n constant at 1000 since 
our algorithm maintains a data structure of 0{n) size. We found these numbers 
to be in the middle of the ranges of the values for our real data sets. The data 
set is a randomly generated set of points. Initially, m is set to 500 and T to 
100,000 tuples. Figure 8 plots the runtime of the algorithm with respect to the 
number of tuples T. The performance scales linearly as expected. The number 
of maximal empty rectangles also scales almost linearly with T as our analytic 
results of Section 3.4 predicts. 



Number ofMaximal 





Runtime (min) 

120 I 




SOM lOOM ISOM 200M 

Size of Matrix 



Fig. 9. Random data as 

Fig. 8. Random data under constant n as T increases, the matrix size is increa- 
sed (constant T). 
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Effect of Data Density on Runtime. Note that the algorithm requires only a 
single pass over the data which is why we expected this linear scale up for the 
previous experiment. However, the in memory processing time is 0{nm) which 
may, for sparse data be significantly more than the size of the data. Hence, we 
verified experimentally that the algorithm’s performance is dominated by the 
single pass over the data, not by the 0{n) in memory computations required 
for each row. In this experiment, we kept both T and n constant and increased 
m. As we do, we increase the sparsity of the matrix. We expect the runtime 
performance to increase but the degree of this increase quantifies the extent to 
which the processing time dominates the I/O. Figure 9 plots the runtime of the 
algorithm with respect to the size of the matrix. 

Effect of R, the number of maximal empty rectangles, on Runtime. Since the 
data was generated randomly in the first experiment, we could not precisely 
control the number of empty rectangles. In this next experiment, this number 
was tightly controlled. We generated a sequence of datasets shown for clarity in 
matrix representation in Figure 10. 

Let m = 1000, n = 2000, T = 1,000,000. We start with a matrix that has 
1000 colums filled with 1-entries separated by 1000 columns filled with 0-entries 
(for a total of 2000 columns). For each new test, we cluster the columns so that 
the number of spaces separating them decreases. We do this until there is one 
big square 1000x1000 filled with 1-entries, and another large square 1000x1000 
filled with Os. Thus, the number of empty rectangles decreases from 1000 to 1. 
We would expect that R should not affect the performance of the algorithm and 
this is verified by the results (Figure 10). 
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Fig. 10. Data sets and performance as the nnmber of maximal empty rectangles R is 
increased. 



Effect of n on the Runtime. We also tested the performance of the algorithm 
with respect to the number of distinct X values, n. Here, we kept T constant 
at 1,000,000 and R constant at 1,000 and varied both n and m. To achieve this, 
we constructed a sequence of data sets shown again in matrix representation in 
Figure 11(a). 

For the first matrix, i is set to 1 so the data contains 1000 columns of 1-entries, 
each a single entry wide. Each column was separated by a single column of all O’s 
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width of each column is/ 




1000 columns of O’s 



Time (h) 




Number of distinct values of X (log scale) 



(a) 



(b) 



Fig. 11. Synthetic data under constant R and T. Runtime is plotted versus n. 



(all columns are initially 1000 entries high) . In the second matrix, the height of all 
columns is reduced by half (to 500 entries) . The width of each column (both the 
columns containing I’s and the columns containing O’s) is doubled. This keeps 
the number of tuples constant, while increasing to 2000 the number of distinct 
X values. The number of columns with O’s does not change (only their width 
increases), hence the number of discovered empty rectangles remains constant. 
The process of reducing the height of the columns and multiplying their number 
is continued until all columns are of height 4. 

The performance of the algorithm, as shown in Figure 11(b), deteriorates 
as expected with increasing n. Specifically, when the size of the data structures 
used grows beyond the size of memory, the data structures must be swapped 
in and out of memory. To avoid this, for data sets over two very large domains 
(recall that n is the minimum number of distinct values in the two attributes), 
the data values will need to be grouped using, for example, a standard binning 
or histogram technique to reduce the number of distinct values of the smallest 
attribute domain. 

5 Conclusions 

We developed an efficient and scalable algorithm that discovers all maximal 
empty rectangles with a single scan over a sorted two-dimensional data set. Pre- 
viously proposed algorithms were not practical for large datasets since they did 
not scale (they required space proportional to the dataset size). We presented 
an extension of our algorithm to multi-dimensional data and we presented new 
results on the time and space complexity of these problems. Our mining algo- 
rithm can be used both to characterize the empty space within the data and to 
characterize any homogeneous regions in the data, including the data itself. By 
interchanging the roles of O’s and I’s in the algorithm, we can find the set of 
all maximal rectangles that are completely full (that is, they contain no empty 
space) and that cannot be extended without incorporating some empty space. 
Knowledge of empty rectangles may be valuable in and of itself as it may reveal 
unknown correlations or dependencies between data values and we have begun 
to study how it may be fully exploited in query optimization [6] . 
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Abstract. Discovering functional dependencies from existing databases 
is an important technique strongly required in database design and ad- 
ministration tools. Investigated for long years, such an issne has been 
recently addressed with a data mining viewpoint, in a novel and more 
efficient way by following from principles of level- wise algorithms. In 
this paper, we propose a new characterization of minimal fnnctional de- 
pendencies which provides a formal framework simpler than previous 
proposals. The algorithm, defined for enforcing onr approach has been 
implemented and experimented. It is more efficient (in whatever confi- 
guration of original data) than the best operational solution (according 
to onr knowledge): the algorithm Tane. Moreover, our approach also 
performs (without additional execution time) the mining of embedded 
functional dependencies, i.e. dependencies holding for a snbset of the at- 
tribnte set initially considered (e.g. for materialized views widely used in 
particular for managing data warehonses). 



1 Motivations 

Functional Dependencies (FDs) capture usual semantic constraints within data. 
An FD between two sets of attributes (A, A) holds in a relation if values of 
the latter set are fully determined by values of the former [11]. It is denoted by 
A — >■ y. Functional dependencies are a key concept in relational theory and the 
foundation for data organization when designing relational databases [25,30] but 
also object oriented databases [39]. 

Discovering FDs from existing databases is an important issue, investigated for 
long years [20,30], and recently addressed with a data mining point of view [18, 
19,26], in a novel and more efficient way. 

Motivations behind addressing such an issue are originated by various applica- 
tion fields: database administration and design, reverse-engineering and query 
folding [29] . Actually extracting such a knowledge makes it possible to assess mi- 
nimality of keys, control normalization and detect denormalized relations. The 
latter situation could be desired for optimization reasons but it could also result 
from design errors, or schema evolutions not controlled over time. In such cases, 
the database administrator is provided with a relevant knowledge for making 
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reorganization decisions. In a reverse-engineering objective, elements of the con- 
ceptual schema of data can be exhibited from the knowledge of FDs [10,32,34]. 
Query folding can also take benefit from the discovered knowledge by selecting 
the best data resource for answering a request or using specific rules satisfied by 
the processed instances [17,35]. 

Motivated by the described application fields, various approaches addressed the 
presented issue, known as the problem of FD inference [16,21]. Among them, we 
underline the various contributions of H. Mannila and K.J. Raiha [20,21,29,30], 
and an operational approach recently proposed Tane [18,19] which provides a 
particularly efficient algorithm when compared with previous solutions for dis- 
covering the set of minimal FDs. Due to its singleness feature, we consider this 
set as the canonical cover of FDs. 

In this paper, we propose an approach for mining minimal functional depen- 
dencies. We define a new characterization of such dependencies simpler than 
previous proposals and based on particularly mere concepts. The associated 
implementation solution, the algorithm Fun, is very efficient: it is compara- 
ble to Tane or improves execution times in whatever configuration of original 
data. Furthermore, our approach performs the mining of embedded dependen- 
cies. Known as the projection of FDs [13,15,25,30], the underlying issue could be 
summarized as follows: being given a set of FDs holding in a relation which are 
those satisfied over any attribute subset of the relation? “This turns out to be a 
computationally difficult problem” [30]. 

The paper is organized as follows. In Sect. 2, we present the formal framework of 
our approach which is enforced through the algorithm Fun detailed in Sect. 3. 
Section 4 is devoted to the exhibition of embedded dependencies. Experimental 
and comparative results are given in Sect. 5 while Sect. 6 provides an overview of 
related work. As a conclusion, we discuss the advantages of our approach when 
compared to previous work and evoke further research work. 



2 A Novel Characterization of Minimal FDs 

We present, in this section, the theoretical basis of our approach. Due to lack 
of space, we do not recall the basic concepts of the relational theory, widely 
described in [25,30] . The definition of a new and mere concept along with variants 
of existing ones, makes it possible to provide a new characterization of minimal 
FDs. The introduced concept of Free Set is defined as follows. 

Definition 1. Free Set 

Let X C R be a set of attributes. X is a free set in r, an instance of relation over 
R, if and only if: A' C A, |A'|r = \X\r where \X\r stands for the cardinality 

of the projection of r over X. 

The set of all free sets in r is denoted by tFSr- Any combination of attributes 
not belonging to RSr is called a non free set. 

Lemma 1. V A C R, V A' C A, |A'|^ = [A],. A' ^ A. 
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Proof. Obvious according to the definition of FD. 

Example 1. In order to illustrate this paper, our relation example (cf. Fig. 1) is 
supposed to be used for managing courses in a university and scheduling lectu- 
res. A lecture is characterized by the teacher giving it {PROF), the associated 
course (CSE), the day and hours {DAY, BEGIN, END), the room, where it 
takes place, and its capacity {ROOM, CAP). 
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Fig. 1. The relation example LECT 

Let us consider, in our relation example, the couple of attributes {DAY, 
BEGIN). It is particularly easy to verify that this couple is a free set since 
its cardinality in the relation LECT is different from the cardinalities of all its 
component attributes: \DAY, BEGIN\lect = 4; \DAY\lect = 3; \BEGIN\lect 
= 2. In contrast, {ROOM, CAP) is not a free set since \ROOM, CAP\lect = 
\ROOM\lect = 4. Thus ROOM — >• CAP holds in the relation LECT (cf. Lemma 
1 ). □ 

Lemma 2. 

— Any subset of a free set is a free set itself: V A G ESr, V X' C X , X' G TSr. 

— Any superset of a non free set is non free: V A ^ IFSr, 'i Y D X ,Y ^ TSr. 

Proof. For the first point of the lemma, let us consider a free set A G iFSr and 
assume, for the sake of a contradiction, that: 3 A' C A such that A' ^ iFSr. 
Thus, according to Definition 1, 3 A" C A' such that \X”\r = \X'\r. And we 
have A" — >■ A' (cf. Lemma 1). Applying the Armstrong axioms [3], we infer 
the following FD: A" U (A — A') — >■ A' U (A — A') which could be rewritten: 
A - (A' - A") ^ A. Thus I A - (A' - A")|r = |A|^ and A ^ TSr which is 
contradicting the initial assumption. Let us assume now that A is not a free set, 
thus 3 A' C A such that \X'\r = \X\r. Since we have the FD A' — >■ A, any 
dependency X'UZ — >■ XUZ is satisfied, V Z C R—X. Thus lA'UZL = lAUZL, 
and AUZ^ 

With the concept of free sets, we are provided with a mere characterization of 
sources (or left-hand sides) of minimal FDs because the source of any minimal 
FD is necessarily a free set (as proved by Theorem 1). In fact, a free set cannot 
capture any FD (cf. Definition I and Lemma I) while at least an FD holds 
between attributes of a non free set. The following definitions are introduced for 
characterizing targets (or right-hand sides) of such dependencies. 
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Definition 2. Attribute set closure in a relation 

Let X he a set of attributes, X C R. Its closure in r is defined as follows: 

X+ = XU {A gR-X/ = \XA A\r}. 

Definition 3. Attribute set quasi-closure in a relation 

The quasi-closure of an attribute set X in r, denoted by Xf, is: 

Xf = XU\J^^^(X-A)+. 

The closure of an attribute set X in the relation r encompasses any attribute A 
of the relation schema R, determined by X. Definition 2 guarantees that X ^ A 
holds by comparing cardinalities of X and X U A (cf. Lemma 1). The quasi- 
closure of X groups the closure of all its maximal subsets and X itself. Thus 
any attribute A G Xf — X is determined by a maximal subset X' of X. Since 
r \= X' ^ A (the FD holds in r), we have r \= X ^ A, but the latter FD is not 
minimal because its source encompasses an extraneous attribute [5]. According 
to the monotonicity and extensibility properties of the closure [7,16], we have: 
X C Xf C A+. Thus Definition 2 can be rewritten (for operational reasons): 
X+ = Xf\J{AG R- Xf / |A|^ = |X U A\r}. 

Through the following theorem, we state that the set of FDs characterized by 
using the introduced concepts is the canonical cover of FDs for the relation r, 
denoted by MinDep{r). 

Theorem 1. MinDep{r) = {X A / X G TSr and A G Xf~ — Xf} 

Proof. (D) Since A G Xif , A is determined by X, thus X ^ A holds. It is 
minimal because A ^ Xf. Thus no maximal subset of X could determine A 
(and a fortiori whatever subset of A). 

(C) Since X ^ A G MinDep(r), we have: V A' C A, [A'j^ yf [Aj^ (cf. Lemma 
1). Thus A G iFSr (cf. Definition 1). It is obvious that A G A+ since the FD 
holds. Moreover, no subset of A (a fortiori the maximal subsets) could determine 
A since the FD is minimal: A ^ Xf. Thus A G A+ — Xf. 

Example 2. Let us consider, in our relation example (cf. Fig. 1), the following 
combination A = (ROOM, DAY, BEGIN), which is a free set. Computing car- 
dinality of A and cardinality of its maximal subsets results in the following 
quasi-closure and closure: 

= {ROOM, DAY, BEGIN, END, GAP} 

= [ROOM, DAY, BEGIN, END,GAP,CSE, PROP} 

Thus both ROOM, DAY, BEGIN -)> CSE and ROOM, DAY, BEGIN -G- PROF 
are minimal FDs, according to Theorem 1. □ 

For mining the canonical cover of FDs in a relation, the closure and quasi-closure 
of attribute sets have to be computed (cf. Theorem 1). Thus counting operati- 
ons must be performed for yielding cardinality of the considered sets and all 
their maximal subsets (cf. Definitions 2 and 3). When introducing the following 
lemma, our concern is providing an efficient computation of cardinalities. Lemma 
3 states that for any non free set A whose subsets are free sets, the cardinality 
of A is equal to the maximal cardinality of its subsets. 
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Lemma 3. V A" ^ TSr such that all its subsets X' are free sets in r: 

|A:|^ = Max( \X'\r / X' dX and X' G XSr )■ 

Proof, y X C R and y X' C X, it is obvious that |Ar|^ > \X'\r (according to 
the FD definition). Since we assume that X ^ iFSr and that its subsets are 
free sets, we know that it does exist, at least, a maximal subset X' determining 
an attribute A (according to Definition 1, Lemmas 1, and 2): 3 AT' C AT, 3 
A = X -X' such that AT' ^ A. Thus |A:|r = |A:' U = \X'\r. 

With Theorem 1, we are provided with a nice characterization of minimal FDs, 
using particularly simple concepts. Lemma 2 gives rules specially interesting 
when implementing the approach using a level- wise algorithm. Lemma 3 com- 
plements this result by defining a novel way to compute FD targets using coun- 
ting inference. This computation, proved to be correct, is moreover operational. 
We show in the following sections (3 and 5) that implementation based on our 
formal approach results in a very efficient solution. 



3 The Algorithm FUN 

Like various other proposals addressing data mining issues (from Apriori [2] to 
Tane [18,19] cf. Sect. 6), our algorithm, called Fun, is a level- wise algorithm 
exploring, level after level, the attribute set lattice of the input relation. We 
describe in more depth the general principles of Fun, detail the algorithms, and 
give a running example. 



3.1 General Principles 

Fun handles, step by step, attribute sets of increased length and, at each new 
level, takes benefits from the knowledge acquired at the previous iteration. 

Let us assume that each level k is provided with a set of possible free sets of 
length k, called candidates. For controlling that a candidate is actually a free set, 
its cardinality must be compared to the cardinality of all its maximal subsets. 
If it is proved to be a free set, it is a possible source of FDs and it must be 
preserved. If not, the candidate encompasses at least a maximal subset with a 
similar cardinality and it captures at least an FD. When the latter situation 
occurs, we know (according to Lemma 2) that all the supersets of the candidate 
are non free and the candidate cannot be source of a minimal FD (according to 
Theorem 1). Thus it is discarded for further exploration. The k*^ level completes 
by yielding the set of all free sets encompassing k attributes, and all minimal 
FDs captured by the initial candidates of length k are exhibited. The set of 
all free sets is used for providing the next level with a new set of candidates. 
Such an operation is performed in a way similar to the procedure of candidate 
generation defined in Apriori [2] and adapted in Tane [18]. Free sets of length k 
are expanded for giving new candidates of length (fc-|- 1), by combining two free 
sets whose subsets are all free sets. The algorithm completes when, at a level, 
no further candidates are generated. 
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3.2 Detailed and Commented Algorithms 

The described general principles are enforced through the algorithm Fun. At 
each level k, the managed set of candidates is called Lfe. Any element of this set 
is described as a quadruple: {candidate, count, quasiclosure, closure) in which 
the elements stand for the list of attributes of the candidate, its cardinality, 
quasi-closure and closure. 

The first candidate set Lq is initialized to a single quadruple in which count is 
assigned to 0, and the other components are empty sets (line 1 of Fun). For 
building L\, all the attributes in R are considered by the algorithm (line 2). 
Their cardinality is computed (using the function Count). Their quasi-closure, 
and closure are initialized to the candidate. The set R' is built. It includes any 
attribute of the relation schema R excepting single attributes which are keys. 
Then each iteration of the loop (lines 4-9) deals with a level by computing the 
closure of free sets discovered at the previous level (i.e. elements of Tfe_i) and, 
from this result, by computing the quasi-closure of candidates at the current 
level. At this point FDs, if any, having a source of (fc — 1) attributes are extrac- 
ted and displayed. Then the stages of pruning and candidate generation apply. 
When iterations complete, FDs captured in the last examined candidates are 
yielded (line 10). 

Algorithm FUN 

1 Lo~< 0 , 0, 0 , 0 > 

2 Li :={< A, Count) A), A, A> \ A & R} 

3 R' R — {A I A is a key } 

4 for { k := 1; Lk 0', k ~ k + 1 ) do 

5 ComputeClosure) Lk-i, Lk ) 

6 ComputeQuasiClosure) Lk, Lk-i ) 

7 DisplayFD) Lk-i ) 

8 PurePrune) Lk, Lk-i ) 

9 Lk+i '■= GenerateCandidate) Lk ) 

10 DisplayFD) Lk-i ) 
end FUN 

Apart from the procedure DisplayFD which simply yields the minimal FDs di- 
scovered at each level, functions and procedures used in Fun are described and 
their pseudo-code is given. 

From the set of free sets in Lk, GenerateCandidate yields a new set of candidates 
encompassing fc -I- 1 attributes, supersets of keys excluded (line 1) by following 
from the principles of apriori-gen [2] (recalled in Sect. 3.1). 

Function GenerateCandidate) in Lk ) 

1 Lk+i ~{l\yi'cl,\l\ = \l'\ + l,l'G Lk, and \l'\r |r| } 

2 for each I G Lk+i do 

3 1. count Count) 1. candidate ) 

4 return Lk+i 

end GenerateCandidate 
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ComputeClosure applies to the set of free sets achieved at the previous level 
and builds the closure of each free set I in the following way. Initially set to 
the quasi-closure of I (line 3), the closure is complemented by adding attributes 
determined by 1. The procedure assesses whether I — >■ A holds by comparing car- 
dinalities of I and I U {A} (line 5). Such a comparison makes use of the function 
FastCount (described further). 

Procedure ComputeClosure( inout Lk-i, in Lk ) 

1 for each I € Lfc-i do 

2 if Z is not a key then 

3 1. closure := 1. quasiclosure 

4 for each A £ R' — 1. quasiclosure do 

5 if FastCount( Lk-i, Lk, l.candidateD {A} ) = l.count 

6 then l.closure := l.closure U {A} 

end CompnteClosure 

The procedure ComputeQuasiClosure builds the quasi-closure of each combina- 
tion in the current set of candidates. Initialized to the candidate (line 2), the 
quasi-closure of any candidate I is achieved (line 4) by computing the union of 
its maximal subset closures (according to Definition 3). If the candidate under 
consideration is a key, its closure is also computed. 

Procedure ComputeQuasiClosure( inout Lk, in Lk-i ) 

1 for each I G Lk do 

2 1. quasiclosure ~ 1. candidate 

3 for each s C l.candidate and s G Lk-i do 

4 1. quasiclosure := 1. quasiclosure U s. closure 

5 if Z is a key then l.closure R 

end CompnteQuasiClosure 

The procedure PurePrune is intended for eliminating non free sets from Lk- Ma- 
king the decision of discarding a candidate is simply based on the comparison of 
cardinalities of the candidate and its maximal subsets which are free sets (line 3). 

Procedure PurePrune( inout Lk, in Lk-i ) 

1 for each Z G Lk do 

2 for each s C l.candidate and s G Lk-i do 

3 if l.count = s.count then Delete Z from Lk 

end PurePrune 

The function FastCount efficiently yields the cardinality of a candidate Z by 
whether accessing its counting value or returning the maximal counting value of 
its subsets (cf. Lemma 3). 

Function FastCount( in Lk-i, in Lk, in l.candidate ) 

1 if l.candidate G Lk then retnrn l.count 

2 return Max( I' .count \ I' .candidate C l.candidate, I' .candidate G Lk-i ) 

end FastCount 
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Like in all other level-wise algorithms in data mining, the pruning rules are 
enforced in Fun for minimizing the number of candidates to be verified (without 
calling PurePrune, Fun would yield the very same cover). Another optimization 
technique used when implementing Fun is dealing with stripped partitions (thus 
a single pass over initial data is necessary) [12,18]. Stripped partitions provide 
an optimized representation of the original relation only requiring to preserve 
tuple identifiers, thus handled data is incomparably less voluminous. 



3.3 Running Example 

For exemplifying Fun running, we illustrate how the algorithm unfolds by using 
our relation example. In Fig. 2, attributes are denoted by their initial, excepting 
CAP symbolized by Ca. For the various levels, candidates (column X), their 
counting value (count), their quasi-closure (A*) and closure (A"*") are given. 
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Fig. 2. Running example for the relation LECT 

At level 1, all the attributes are considered, and their cardinality is computed. 
Two keys are discovered (jPj = 7; jCj = 7) and thus discarded for further se- 
arch. The quasi-closure of the handled attributes is reduced to single attributes. 
Then the candidate generation is performed and all possible couples of remai- 
ning attributes are built. Provided with these new candidates, level 2 begins 
by computing the closure of free sets in the previous level, and then computes 
the quasi-closure of current candidates. Combinations capturing FDs are deleted 
(simply struck out in the figure), and of course they are not used for generating 
candidates for level 3. Among the latter, two keys (DBR, DER) are discovered 
and since it remains a single combination, no new candidate can be generated 
thus the algorithm completes. The extracted minimal FDs are given by using 
the union of their targets: P — >■ CDBERCa] C — >■ PDBERCa] E — >■ BCa; 
R Ca; BR E; BCa E; DBR PC; DER PC. 

4 Discovering Embedded Dependencies 

When being provided with a set of FDs holding in the relation r, embedded 
dependencies are FDs valid in the projection of r over a subset of its attribu- 
tes. Embedded dependencies capture an interesting knowledge for the database 
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administrator, particularly relevant when the number of attributes is large in a 
relation. In such a situation, or when the number of extracted FDs is great, the 
administrator can focus on particular attribute subsets, and the results of FD 
discovery can best be understood and more easily elucidated. Moreover, when 
handling materialized views, embedded dependencies exactly capture the FDs 
valid in the views. 

Being given a set T of valid FDs in the relation r, embedded dependencies (also 
called projection of FDs [13,15]) are the FDs holding for an attribute subset X 
of R. In [15,30], the set of embedded FDs is defined as follows: 

T[X] = {Y ^ Z / T ^ Z ^YZ C X}. 

Computing .7^[^] cannot be solved by simply discarding from T dependencies 
having a source or a target not included in X because, due to FD properties [3], 
new dependencies could be introduced in T[X] as illustrated in the following 
example. In fact the problem is computationally hard [30] and computing FD 
projection can be exponential [13]. 

Example 3. Let us consider the relation schema i?(Ai, . . . , A„, B, C) and sup- 
pose that T encompasses the following FDs: 

{Ai ^ 5,^2 ^ B, ..., ^ ^ C,C ^ ^ Ar,}. 

Then J-\A \, . . . , C] = {Ai — >■ C, . . . , — >■ C, C — >■ Ai , . . . , C — >■ A„}. □ 

Complementing formal definitions provided in Sect. 2, minimal embedded FDs 
can be characterized by using the simple definition of Embedded Free Sets. 

Definition 4. Embedded Free Sets 

Let X G R he an attribute set. The set of all free sets included in X is denoted 
by TSx and defined by: TSx = {X' / X' & TSr, X' G X}. 

Lemma 4. The set of embedded minimal FDs in X is: 

Tnr[X] ={X' ^ A / X' & TSx, A G fx;.+ - X'f^) n X}. 

Proof. It is obvious that X' — >■ A is a minimal FD (cf. Theorem 1) captured in 
X (because X embodies both X' and A). Since ESx encompasses any free set 
included in X, it cannot exist a minimal embedded FD not included in Trn[X]. 

In the current implementation of Fun, free sets of whatever size, their closure 
and quasi-closure are preserved in main memory because allocated memory is 
smalb. In such conditions, computing Tjn[X] is straightforward. Once a free 
set belonging to TSx is discovered, it is simply necessary to verify whether 
attributes in Ai'+ — X'f" are included in X. 

5 Experimental Results 

In order to assess performances of Fun, the algorithm was implemented using 
the language C-l— 1-. An executable file can be generated with Visual C-|— I- 5.0 or 

^ Experiments performed for a relation encompassing 100,000 tuples and 50 attributes 
(with a correlation rate of 70 %) show that only 24.9 MB are required for both 
partitions and Fun intermediate results. 
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GNU g++ compilers. Experiments were performed on an AMD K6-2/400 MHz 
with 256 MB, running Windows NT 4.0. For a sake of an accurate comparison, 
a new version of Tane was implemented under the very same conditions. Both 
programs are available at [14]. The benchmark relations used for experiments 
are synthetic data sets automatically generated, using the following parameters: 
jrj is the cardinality of the relation, |i?| stands for its number of attributes and 
c is the rate of data correlation. The more it increases, the more chances there 
are to have FDs captured within data. 

Figure 3 details, for various values of the two latter parameters, execution times 
(in seconds) for Fun and Tane, setting the tuple number to 100,000. Empty cells 
in the table means that execution exceeded three hours and was interrupted^. 
As expected. Fun outstrips Tane in any case. More precisely, when relations 
have few attributes and data is weakly correlated, results are almost comparable. 
However, the gap between execution times of Fun and Tane regularly increases 
when the relation attribute set is enlarged and/or the correlation rate grows, 
until a threshold beyond which performances of Tane are strongly debased while 
Fun remains efficient. For data weakly and strongly related, curves in Figs. 
4 and 5 illustrate execution times of the two algorithms when increasing the 
number of attributes for various values of c. These curves highlight the evoked 
threshold related of the number of attributes. When c is set to 30 %, Tane 
performance clearly debases beyond 50 attributes (cf. Fig. 4 A). When increasing 
the correlation rate the threshold decreases: until 40 or 30 attributes for c = 50% 
or 70% (cf. Figs. 4 B and 5 A). Curves in Fig. 5 B give execution times of Tane 
and Fun when varying the correlation rate for a relation with 100,000 tuples and 
40 attributes. The gap between execution times of the two algorithms is strongly 
enlarged when data is highly correlated. Being given a number of attributes and 
a correlation rate, execution times of Fun and Tane only increase linearly in 
the number of tuples. 

There is a twofold reason under the best performance of Fun when compared 
with Tane: the fast computation of FD targets and a more efficient pruning (as 
briefly explained in Sect. 7). 
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Fig. 3. Execution times in seconds for correlated data (|r| = 100,000) 



Other experiment results are available at the URL given in [14]. 
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Fig. 4. Execution times in seconds for correlated data 
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Fig. 5. Execution times in seconds for correlated data and for various correlation rates 



6 Related Work 

As suggested in Sect. 1 and in [31], the problem of discovering FDs is close 
to other data mining issues, in particular the discovery of association rules or 
sequential patterns [1,2,33]. Nevertheless, tangible differences appear^. In this 
section, we only focus on related work addressing FD discovery. 

® Algorithms mining association rules deal with itemsets. For a sake of comparison, 
they could be seen as managing values of binary attributes [33], but when adopting 
this vision, the number of attributes is extremely important, and optimized data 
representation based on partition could not apply. The efficiency problem is much 
more critical when considering the original data to be mined. For rules, the most 
interesting raw material is historical data preserved in data warehouses, whereas 
FD discovery applies to operational databases. Moreover the former approaches are 
interested in extracting all (maximal) frequent itemsets whereas FD discovery ap- 
proaches aim mining minimal FDs. Finally, counting operations are, in the former 
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The FD inference issue has received a great deal of attention and is still investi- 
gated. Let us quote, among numerous approaches, the following proposals [8,18, 
26,28,29,36]. A critical concern when discovering FDs is to propose algorithms 
still efficient when handling large amounts of data. Earlier proposals based on 
repeated sorting of tuples or comparison between tuples could not meet such a 
requirement. The various contributions of H. Mannila, K.J. Raiha et al. must 
be underlined because they define the theoretical framework of the tackled pro- 
blem, study its complexity, and propose algorithms for solving it [28,29] while 
addressing very close issues such as building Armstrong relations [27], investiga- 
ting approximate inference in order to discover an approximate cover of depen- 
dencies [21], or studying FD projection [30]. However they do not provide and 
experiment with operational solutions. Computing the FD projection is studied 
in [13], and G. Gottlob proposes an algorithm called “Reduction By Resolution” 
(RBR) [15]. Its correctness is shown in [15,30], and its complexity is polyno- 
mial for certain classes of FDs. Various related problems such as construction of 
Armstrong relations, testing key minimality or normal form, study of other de- 
pendencies, are addressed [4,5,6,12,16,37], included in the context of incomplete 
relations [23,24,22]. 

According to our knowledge, the most efficient algorithm proposed until now is 
Tane [18,19]. Tane is based on the concept of partition [12,38] which is used as 
a formal basis for a new characterization of FDs and as a powerful mechanism for 
optimizing size of managed data. By using partitions and embodied equivalence 
classes, Tane characterizes sources of FDs. The associated targets are compu- 
ted by using, for each examined source, a set of possible determined attributes. 
Initialized to R, this set is progressively reduced by discarding targets once they 
are discovered. The approach adopts principles of level-wise algorithms for ex- 
ploring the search space, and pruning rules are enforced. Experimental results 
show the efficiency of the proposed algorithm and its good performance when 
compared to Fdep [36] chosen for its availability and efficiency. 

In [26], an alternative approach Dep-Miner is proposed. Inspired from [27,29], it 
is based on the concepts of agree set [6] and maximal set also called meet irre- 
ducible set [16,25]. For any tuple couple, the agree set is the set of all attributes 
sharing the very same value. The maximal set of any attribute A is the attribute 
set of greatest size not determining A. The approach fits in a rigorous frame- 
work complementing proposals of [20,27,29], and introduces a characterization 
of minimal FD sources, based on minimal transversals of a simple hypergraph. 
The defined algorithm adopts the optimized representation of partitions, used in 
Tane, for minimizing size of handled data. New fast algorithms are proposed for 
computing agree sets and minimal transversals of an hypergraph. Experiments, 
run on various benchmark databases, show the efficiency of Dep-Miner"^. Dep- 
Miner also computes small Armstrong relations. 



case, performed level by level, requiring at each level a pass over data. When dealing 
with FDs, counting operates attribute by attribute. 

The comparison between Fun and the original version of Dep-Miner show an impor- 
tant gap, benefiting to Fun, between execution times. A new version of Dep-Miner 
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Tane and Dep-Miner particularly well illustrate what data mining contributi- 
ons have brought to a problem addressed for long years and still topical and 
important, in particular for database administration tools. 



7 Discussion and Further Work 

The approach presented in this paper provides both a novel characterization 
of minimal FDs which can be embedded or not, and an operational algorithm 
Fun particularly efficient for mining the canonical cover of FDs in a relation. 
Experiments run on very large relations show the efficiency of our proposal com- 
pared with the best current algorithm Tane. For discovering minimal embedded 
dependencies, our solution is straightforward from Fun results whereas the al- 
gorithm RBR [15] can be exponential. 

In order to complement experimental results given in Sect. 5, we briefly compare 
Tane and our approach. With the concept of partition, Tane provides a simple 
characterization of minimal FD sources. Nevertheless, specifying targets of such 
dependencies requires numerous lemmas which take into account different pru- 
ning cases. This results in the definition of the possible target set for a source 
(called C+) difficult to comprehend and relatively costly to compute. The gap 
observed between execution times of Tane and Fun is originated on one hand 
by the C+ computation and on the other hand by the exploration of the search 
space, more reduced in Fun than in Tane [14]. In fact, pruning operations in 
Tane are intended for discarding irrelevant candidates, among which certain 
capture non minimal FDs, whereas, in Fun, all examined candidates only cap- 
ture (by construction) possible minimal FDs. Any candidate provided with a 
set C+ different from i? is a non free set. It is deleted by Fun whereas Tane 
eliminates a candidate only if its associated C+ is empty. Thus the search space 
explored by Fun is smaller than the one of Tane. 

The new characterization that we propose is simpler and sound. It is based on 
three mere concepts. Characterization proved to be correct and implementation 
successfully experimented are, in our approach, clear cut. 

Following from the ideas introduced in [9] which aim to identify a “common data 
centric step” for a broad class of data mining algorithms, we would like to do 
likewise for database analysis. Actually, based on the discovery of free sets with 
Fun, various related problems, such as generation of Armstrong relations, mini- 
mal key inference, 3NF and BCNF tests, could be solved in a nice and efficient 
way (i.e. without significant additional time). 

Acknowledgment. We would like to thank Loth Lakhal for his constructive 
comments on the paper. 



is currently developed in order to perform valid and accurate comparisons with Fun. 
The first results confirm the best efficiency of Fun. 
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Abstract. In data mining association rules are very popular. Most of 
the algorithms in the literature for finding association rules start by se- 
arching for frequent itemsets. The itemset mining algorithms typically 
interleave brute force counting of frequencies with a meta-phase for pru- 
ning parts of the search space. The knowledge acquired in the counting 
phases can be represented by frequent set expressions. A frequent set ex- 
pression is a pair containing an itemset and a frequency indicating that 
the frequency of that itemset is greater than or equal to the given fre- 
quency. A system of frequent sets is a collection of such expressions. We 
give an axiomatization for these systems. This axiomatization characte- 
rizes complete systems. A system is complete when it explicitly contains 
all information that it logically implies. Every system of frequent sets 
has a unique completion. The completion of a system actually represents 
the knowledge that maximally can be derived in the meta-phase. 



1 Introduction 

Association rules are one of the most studied topics in data mining. They have 
many applications [1]. Since their introduction, many algorithms have been pro- 
posed to find association rules [1] [2] [8] . 

We start with a formal definition of the association rule mining problem as 
stated in [1]: Let 2 = ■ • ■ ,Im} be a set of symbols, called items. Let T> 

be a set of transactions, where each transaction T is a set of items, TCI, and 
a unique transaction ID. We say that a transaction T contains X, a set of some 
items in T, if X C T. The fraction of transactions containing X is called the 
frequency of X. An association rule is an implication of the form X ^ Y, where 
X C X, Y C T, and X DY = 4>. The rule holds in the transaction set V with 
confidence c if the fraction of the transactions containing X, that also contain 
Y is at least c. The rule X ^ Y has support s in the transaction set T> if the 
fraction of the transactions in T> that contain X U T is at least s. 

Most algorithms start with searching itemsets that are contained in at least 
a fraction s of the transactions. To optimize the search for frequent itemsets, the 
algorithms use the following monotonicity principle: 
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if A" C 1", then the frequency of X will never be smaller than the fre- 
quency of V. 

This information is then used to prune parts of the search space a priori. To 
exploit this monotonicity as much as possible, the apriori-algorithm [2] starts 
by counting the single itemsets. In the second step, only itemsets {^ 1 , 12 } are 
counted where {i\} and { 12 } are frequent. All other 2-itemsets are discarded. 
In the third step, the algorithm proceeds with the 3-itemsets that only contain 
frequent 2-itemsets. This iteration continues until no itemsets that can be fre- 
quent are left. The search of frequent itemsets is thus basically an interleaving 
of a counting phase and a meta-phase. In the counting phase, the frequencies of 
some predetermined itemsets, the so-called candidates are counted. In the meta- 
phase the results of the counting phase are evaluated. Based on the monotonicity 
principle, some itemsets are a priori excluded. 

Although the monotonicity of frequency is commonly used, there is to our 
knowledge no previous work that discusses whether in the general case this rule 
is complete, in the sense that it tells us everything we can derive from a given 
set of frequencies. In this paper we consider the notion of a system of frequent 
sets. A system of frequent sets contains, possibly incomplete, information about 
the frequency of every itemset. For example, A :: 0.6, B :: 0.6, AB :: 0.1, (f :: 0.5 
is a system of frequent sets. This system of frequent sets represents partial in- 
formation (e.g. obtained in counting phases.) In this system, A :: 0.6 expresses 
the knowledge that itemset A has a frequency of at least 0.6. The system can 
be improved. Indeed; we can conclude that AB :: 0.2 holds, since A :: 0.6 and 
B :: 0.6 and there must be an overlap of at least a 0.2-fraction between the tran- 
sactions containing A and the transactions containing B. We can also improve 
4> :: 0.5, because f) :: 1 always holds. Therefore, this system is called incomplete. 
When a system cannot be improved, it is complete. The completion of a system 
represents the maximal information that can be assumed in the meta-phase. 

We give three rules FI, F2, and F3 that characterize complete systems of 
frequent sets; e.g. a system is complete iff it satisfies FI, F2, F3. We show that, 
after a small modification to F3, this axiomatization is finite and every logical 
implication can be inferred using these axioms a finite number of times. 

As an intermediate stage in the proofs, we introduce rare sets. A rare set 
expression K : px expresses that at most a p/c-fraction of the transactions does 
not contain at least one item of K. 

The structure of the paper is as follows: in Section 2 related work is discus- 
sed. In Section 3 we formally define a system of frequent sets. In Section 4, an 
axiomatization for complete systems of frequent sets is given. Section 5 discus- 
ses inference of complete systems using the axioms. Section 6 summarizes and 
concludes the paper. 

Many proofs in this paper are only sketched. The full proofs can be found in 
[3]. 
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2 Related Work 

In artificial intelligence literature, probabilistic logic is studied intensively. The 
link with this paper is that the frequency of an itemset I can be seen as the 
probability that a randomly chosen transaction from the transaction database 
satisfies I; i.e. we can consider the transaction database as an underlying pro- 
bability structure. 

Nilsson introduced in [12] the following probabilistic logic problem: given a 
flnite set of m logical sentences Si, , Sm deflned on a set A = {x \, . . . , s„} 
of n boolean variables with the usual boolean operators A,V, and -i, together 
with probabilities Pi, . . . ,Pm, does there exists a probability distribution on the 
possible truth assignments of X, such that the probability of Si being true, 
is exactly pi for all 1 < i < m. Georg akopoulos et al. prove in [7] that this 
problem, they suggest the name probabilistic satisfiability problem (PSAT), is 
NP-complete. This problem, however, does not apply to our framework. In our 
framework, a system of frequent sets can always be satisfied. Indeed, since a 
system only gives lower bounds on the frequencies, the system is always satisfied 
by a transaction database where each transaction contains every item. 

Another, more interesting problem, also stated by Nilsson in [12], is that 
of probabilistic entailment. Again a set of logical sentences S\,. . . , Sm, together 
with probabilities Pi, . . . ,Pm is given, and one extra logical sentence 5^+1, the 
target. It is asked to And best possible upper and lower bounds on the proba- 
bility that Sm+i is true, given Si, ... , Sm are satisfied with respective probabi- 
lities pi, . . . ,Pm- The interval deflned by these lower and upper bounds forms 
the so-called tight entailment of Sm+i- It is well known that both PSAT and 
probabilistic entailment can be solved nondeterministically in polynomial time 
using linear programming techniques. In our framework, a complete system of 
frequent sets is a system that only contains tight frequent expressions; i.e. the 
bounds of the frequent expressions in the complete system are the best possi- 
ble in view of the system, and as such, this corresponds to the notion of tight 
entailment. 

For a comprehensive overview of probabilistic logic, entailment and various 
extensions, we refer to [9] [10]. Nilsson’s probabilistic logic and entailment are 
extended in various ways, including assigning intervals to logical expressions 
instead of exact probability values and considering conditional probabilities [6] . 

In [4], Fagin et al. study the following extension. A basic weight formula is an 
expression aiw{(j>i) -|- . . . -I- akw{4>k) > c, where ai, . . . ,Uk and c are integers and 
4>i, ... ,4>k are propositional formulas, meaning that the sum of all Ui times the 
weight of is greater than or equal to c. A weight formula is a boolean combi- 
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nation of basic weight formulas. The semantics are introduced by an underlying 
probability space. The weight of a formula corresponds with the probability that 
it is true. The main contribution (from the viewpoint of our paper) of [4] is the 
description of a sound and complete axiomatization for this probabilistic logic. 
The logical framework in our paper is in some sense embedded into the logic in 
[4]. Indeed, if we introduce a propositional symbol Pt for each item i, the fre- 
quent set expression K :: px can be translated as rc(AieA: — Pk- such, by 

results obtained in [4], the implication problem in our framework is guaranteed 
to be decidable. Satisfiability, and thus also the implication problem, are NP- 
complete in Fagin’s framework. Our approach differs from Fagin’s approach in 
the sense that we only consider situations where for all expressions a probability 
is given. 

Also in [6], axioms for a probabilistic logic are introduced. However, the 
authors are unable to proof whether the axioms are complete. For a sub-language 
(Type-A problems), they proof that their set of axioms is complete. However, this 
sub-language is not sufficiently powerful to express frequent itemset expressions. 

On the other side of the spectrum, we have related work within the context 
of data mining. There have been attempts to proof some completeness results 
for itemsets in this area. One such attempt is described shortly in [11]. In the 
presence of constraints on the allowable itemsets, the authors introduce the 
notion of ccc- optimality^ . ccc-optimality can intuitively be understood as “the 
algorithm only generates and tests itemsets that still can be frequent, using the 
current knowledge.” Our approach however, is more general, since we do not 
restrict ourselves to a particular algorithm. No attempt is known to us in the 
context of data mining, that studies what we can derive from an arbitrary set 
of frequent itemsets. 

Finally, we would like to add that in our paper the emphasis is on intro- 
ducing a logical framework for frequent itemsets and not on introducing a new 
probabilistic logic, nor on algorithms. 



3 Complete System of Frequent Sets 

We formally define a system of frequent sets. We also define what it means for 
a system to be complete. 

To represent a database with transactions, we use a matrix. The columns 
of the matrix represent the items and the rows represent the transactions. The 
matrix contains a one in the (i, j)-entry if transaction i contains item j, else this 
entry is zero. When i? is a matrix where the columns represent the items in I, we 
say that i? is a matrix over I. In our running example we regularly refer to the 
items with capital letters. With this notation, we get the following definition: 

Definition 1. Let I = {/i, . . . ,/„} he a set of items, and R be a matrix over I. 
The frequency of an itemset K C I in R, denoted freq{K, R) is the fraction of 
rows in R that have a one in every column of K. 

^ ccc-optimality stands for Constraint Checking and Counting-optimality 
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Example 1. In Fig. 1, a matrix is given, together with some frequencies. The 
frequency of DEF ^ is 0.2, because 2 rows out of 10 have a one in every column 
of DEF. Note that, because i? is a matrix, R can have identical rows. 



freq{A, R) = 0.7 
freq{B, R) = 0.5 
freq{AB, R) = 0.3 
freq{DEF, R) = 0.2 

R satisfies A :: 0.5, AB :: 0.3, 
DEF :: 0.1 

R does not satisfy A :: 0.8, 

ABC :: 0.4, DEF :: 0.3 



Matrix R 
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Fig. 1. A matrix together with some frequent set expressions 

We now introduce logical implication and completeness of a system of fre- 
quent sets. 

Definition 2. Let I = {/i, . . . , /„} be a set of items. 

— A frequent set expression over / is an expression K :: pk with K Q I and 
Pk rational with 0 < pk < 1 • 

— A matrix R over I satisfies K :: px iff freq{K, R) > px. Hence itemset K 
has frequency at least px- 

— A system of frequent sets over I is a collection 

{kqi K :: Px 

of frequent set expressions, with exactly one expression for each K C I. 

— A matrix R over I satisfies the system K :: px iff R satisfies all K :: 
Pk- 



Example 2. In Fig. 1, the matrix R satisfies A :: 0.6, because the frequency of 
A in i? is bigger than 0.6. The matrix does not satisfy B :: 0.7, because the 
frequency of B is lower than 0.7. 

Definition 3. Let L = {/i, . . . , /„} be a set of items, and K C L. 

— A system of frequent sets S over L logically implies K :: px, denoted S ^ 
K :: Px, iff every matrix that satisfies S, also satisfies K :: px. System 
Si logically implies system S 2 , denoted Si \= S 2 , iff every K :: p in S 2 is 
logically implied by Si. 

DEF denotes the set {D, E, F} 



2 
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ABC :: 0.4 




AB-.-.Q.Q AC:: 0.4 BC :: 0.6 




A :: 0.6 B :: 0.8 C :: 0.8 




4,y. 1 



Fig. 2. Proof-matrices for a system of frequent sets 



— A system of frequent sets S = K :: pK is complete iff for each K :: p 

logically implied by S, p < px holds. 

Example 3. Let I = {A,B,C,D,E,F}. Consider the following system: S = 
K PK , where pA = 0.7, ps = 0.5, pab = 0.3, pdef = 0.2, and px = 0 
for all other itemsets K. The matrix in Fig. 1 satisfies S. S is not complete, 
because in every matrix satisfying DEE :: 0.2, the frequency of DE must be 
at least 0.2, and S contains DE :: 0. Furthermore, S does not logically imply 
EE :: 0.5, since R satisfies S, and R does not satisfy EF :: 0.5. 

Consider the following system over I = {A,B,C}: 

{(j> :: 1,A :: 0.6, B :: 0.8, C :: 0.8, AB :: 0.6, AC :: 0.4, BC :: 0.6, ABC :: 0.4}. 
This system is complete. We prove this by showing that for every subset K of 
I, there exists a matrix Rx that satisfies S, and freq{K, Rx) is exactly px- 
These matrices then prove that for all K, we cannot further improve on K; i.e. 
make px larger. These proof-matrices are very important in the proof of the 
axiomatization that is given in the next section. In Fig. 2, the different proof- 
matrices are given. 

When a system S is not complete, we can improve this system. Suppose a 
system S = K :: px is not complete, then there is a frequent set expression 
K :: p'f, that is logically implied by S, and p'j^ > px- We can improve S by 
replacing K :: px hy K :: p'^. The next proposition says that there exists a 
unique system C(S'), that is logically implied by S and that is complete. 

Proposition 1. Let I = {/i, . . . , /„} he a set of items, and S = K :: px 
he a system of frequent sets. There exists a unique system C(S'), the completion 
of S, such that S |= C(S'), and C{S) is a complete system. 

Proof. Let Mx = {px \ S \= K ■.■. px}. Mx always contains its supremum. 
This can easily be seen as follows: suppose a matrix M satisfies S. Let p be 
the frequency of K in M. Since M satisfies S, for all px G Mx, p > px holds, 
and hence p > sup{Mx) holds. Hence, every matrix satisfying S, also satisfies 
K :: sup{Mx), and thus S \= K :: sup{Mx). It is straightforward that the 
system K :: supp{Mx) is the unique completion of S. 
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Example 4-. I = {A,B,C}. The system {<j) :: 1,A :: 0.6, B :: 0.8, C :: 0.8, AB :: 
0.6, AC :: 0.4, BC :: 0 . 6 , ABC :: 0.4} is the unique completion of the system 
{(j ) :: 0.8, A :: 0.6, B :: 0.8, C :: 0.8, AB :: 0.6, AC :: 0.4, BC :: 0.4, ABC :: 0.4}. 
BC :: 0.6 is implied by the second system, since there is an overlap of at least 
0.6 between the rows having a one on B and the rows having a one on C. 

Remark that when a system is complete, it is not necessary that there exists 
one matrix such that for all itemsets the frequency is exactly the frequency given 
in the system. Consider for example the following system: {(p :: 1,A :: 0.5, R :: 
0.5, C :: O.l, AB :: 0,AC :: 0,BC :: 0,ABC :: 0}. This system is complete. 
However, we will never find a matrix in which the following six conditions are 
simultaneously true: freq{A) = 0.5, freq{B) = 0.5, freq{C) = 0.1, freq{AB) = 
0, freq{AC) = 0, and freq{BC) = 0, because due to freq{A) = 0.5, freq{B) = 
0.5, and freq{AB) = 0, every row has a one in A or in B. So, every row having a 
one in C has also a one in A or in B, and thus violates respectively freq{AC) = 0, 
or freq{BC) = 0. 

4 Axiomatizations 

We give an axiomatization for frequent sets. An axiomatization in this context 
is a set of rules that are satisfied by the system if and only if it is complete. In 
order to simplify the notation we first introduce rare sets. In Section 5 we will 
show how we can build finite proofs for all logical implications using the ax;ioms 
as rules of inference. 



4.1 Rare Sets 

Definition 4. Let I = {I\, . . . , /„} he a set of items, and K C I. 

— Let R be a matrix over L. The rareness of an itemset K C I in R, denoted 
rare{K,R), is the fraction of rows in R that have a zero in at least one 
column of K. 

— A rare set expression over L is an expression K : px with K Q L and px 
rational with 0 < px < 1 • 

— A matrix R over I satisfies K : px iff rare{K, R) < px. Hence itemset K 
has rareness at most px ■ 

— A system of rare sets over I is a collection K : Px of rare set expres- 
sions, with exactly one expression for each K C I. 

— A matrix R over I satisfies the system K : px iffR satisfies all K : px- 

— A system of rare sets S over L logically implies K : p, denoted S \= K \ p iff 
every matrix that satisfies S also satisfies K : p. System Si logically implies 
system S 2 , denoted S\ |= S 2 , iff every K : p in S 2 is logically implied by Si. 

— A system of rare sets S = K : px is complete iff for each K : p logically 
implied by S, px < P holds. 
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Example 5. In Fig. 1, the matrix R satisfies A : 0.4, because the rareness of 
A in i? is smaller than 0.4. The matrix does not satisfy B : 0.3, because the 
rareness of B is greater than 0.3. Let I = {A,B}. The system {AB : 0.8, A : 
0.3, B : 0.4, (/) : 0.4} is not complete. The unique completion of this system is 
{AB : 0.7, A: 0.3, B: 0.4, (j): 0}. 

The next proposition connects rare sets with frequent sets. The connection 
between the two is straightforward. Indeed: the set of rows that have a zero in 
at least one column on K is exactly the complement of the set of rows having 
only ones in these columns. The second part of the proposition shows that an 
axiomatization for rare sets automatically yields an axiomatization for frequent 
sets. 

Proposition 2. Let I = {Ii . . . /„} be a set of items. For every matrix R over 
I and every subset K of I holds that 

— freq{K, R) + rare{K, R) = 1. 

— R satisfies K : pK iff R satisfies K :: 1 — px- 

In the following subsection we prove an axiomatization for complete systems 
of rare sets. From this axiomatization, we can easily derive an axiomatization 
for frequent sets, using the last proposition. 

4.2 Axiomatization of Rare Sets 

Before we give the axiomatization, we first introduce our notation of bags. 

Definition 5. 

— A bag over a set S is a total function from S into {0, 1,2,.. .}. 

— Let K. be a bag over S and s € S. We say that s appears n times in K iff 
K(s) = n. 

— // K and L are bags over S, then we define the bag-union of K and L, 
notation HULi, as follows: for all s € S , (K IJ L)(s) = K(s) -|- L(s) . 

— Let S = {si, S 2 , ■ . ■ , s„|. Ci'si, . . . , Cn'sn '^dcnotcs the bag over S in which 
Si appears Ci times for 1 < i < n. 

— Let S be a set, K a bag over S. X)sgS^('®) cardinality of K, and is 

denoted by |K|. 

— Let K. be a bag over the subsets of a set S. Then UK denotes the bag 
Uxgk K- The degree of an element s € S in K., denoted deg{s,'K) is the 
number of times s appears in UK. 

Example ti. K = § l'{a,b},2'{b,c},2'{b,d} S’ is bag over the subsets of 
[a, b, c, d}.\JK = {{ I'a, 5'b, 2! c, 2! d }}. degfb, K) = 5. |K| = 5. 

The next three rules form an axiomatization for complete systems of rare sets 
in the sense that the complete systems are exactly the ones that satisfy these 
three rules. The pxs that appear in the rules, indicate the rareness- values given 
in the system for the set K; i.e. K : px is in the system. 
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R1 = 0 

R2 If K 2 Q Ki, then px 2 < PKi 

R3 Let iL C /, M a bag of subsets of K. Then 

k ’ 

with k = mina (zK{deg{a,M))^ 

The next theorem is one of the most important results of this paper. The 
following lemma, proved in [3], will be used in the proof of the theorem. 

Lemma 1. Given a set of indices I and given rational numbers uk, hx for every 
non-empty K C I. Consider the following system of inequalities: 

o-K < ^ Xi < hx 

i£K 

This system has a solution {x\, . . . ,x^i), Xi rational, iff for all K and L, hags 
of subsets of I with IJ K = IJ L holds that — Xlel 

Theorem 1. Let S = K : px he a system of rare sets over I. The follo- 

wing two statements are equivalent: 

— S is a complete system. 

— S satisfies Rl, R2, and R3. 

Proof. (=^>) Rl and R2 are trivial. 

R3: Let M be a bag over the subsets of an itemset K, and S = {^c/ X : px 
is a complete system. Let R be an arbitrary matrix that satisfies S. is the 
bag that contains exactly those rows r for which there exists a, k in K such that 
r{k) = 0. Then, for every L holds: < pi.. If r G D^, then there exists a 

a € K such that r(a) = 0. a appears in at least k = minaeKdeg{a,'M.) of the 
sets of M. Thus, k\D^\ < Xmsm I^mI- We can conclude that in every matrix 

satisfying S, rare{K,R) = 

(<^=) We show that if S' = X : px satisfies Rl, R2, and R3, we can 

for each itemset K find a proof-matrix Rx, such that Rx satisfies S, and 
rare{K, Rx) = px ’*• We specify Rx by giving the frequency of every possi- 
ble row r. Pz denotes the fraction of rows that have a zero in every column of 
Z, and a one elsewhere. We will show that there exists such a matrix Rx with 
only rows with at most one zero, and this zero, if present, must be in a column 
of K; i.e. whenever \Z\ > 1 or Z ^ K, Pz = 0. 




® If fc = 0, R3 should be interpreted as “p/r < 1” 

Remark the similarities with the traditional Armstrong-relations in functional de- 
pendency theory [5] 
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This can be expressed by the following system of inequalities: 

' Va G AT : 0 < /3a < 1 (1) all fractions are between 0 and 1 

0 < /3o < 1 (2) idem 

< iJ2aeK Pa) + /3o = 1 (3) the frequencies add up to one 

Pk = ^aeK Pa (4) rareness of K is exactly pk 

[VLcif:pL>EaeL Pa (5) for other sets L, p^ > rare{L,RK) 

Every solution of this system describes a matrix that satisfies S. Only (5) needs 
a little more explanation. For an arbitrary itemset L, rare{L, Rx) = rare{L fl 
K,Rx) due to the construction. Because S satisfies R2, p^ > PxnL- Therefore, 
it suffices to demand that rare{L, Rk) < Pl, for all L C K. 

The system has a solution if the following (simpler) system has a solution: 
{'iLC K -.pK -PL< T,a(^KPa ~ T,a(^L Pa < PK (!') 

1 is ok: choose L = K — {a}, then 0 <(^^1 px — Px-{a} < /3a < Pic < 1 
2+3 are ok: let /3 q = 1 - Y.aaK Pa = ^~PK 

4 is ok: choose L = p, pL = 0 (Rl), and thus px < ^aeK Pk ^ Pk 

5 is ok: pl~Pk> J2aeL Pa ~ J2aeK Pa + 4. 

According to Lemma 1, this last system has a rational solution iff for all bags M 
and N over the subsets of K , such that IJ M = IJ N, ^MemiPK ~ Pk-m) < 
X] Af e N Pk holds . 

VL 

Let L = N1J{[A/ — M|MgM)8-. Then, by R3 we have that — > Pk, 
with k = minaeA:#('S 3V|aGA^AiVGN(g-lJ-g/W|/WGMAa^M (§•). 
Because M\MGM.AaGM^ = N\NGNAa£n^,k = #M. 
We have: J2lglPl > #Mp^. Since J2lglPl = T^Ne^PK + T^MemPK-M and 
^'M.pK = ^2m^mPk, ^2m£m(pk ~ Pk—m) ^ Pk holds. 

Example 7. The system {(/> : 0.5, A : 0.5, B : 0.25, C : 0.5, AB : 0,AC : l,BC : 
0, ABC : 1} is not complete, since p : 0.5 violates Rl. 

The system {(/) : 0, A : 0.5, B : 0.25, C : 0.5, AB : 0, AC : 1, BC : 0, ABC : 1} is 
not complete, since for example AB : 0 and A : 0.5 together violate R2. 

The system {^ : 0, A : 0, B : 0, C : 0, AB : 0, AC : 1, BC : 0,ABC : 1} is not 

complete, since A : 0, C : 0, and AC : 1 together violate R3. 

The system {(?/ : 0, A : 0, B : 0, C : 0, AB : 0, AC : 0, BC : 0,ABC : 0} is 

complete, since it satisfies Rl, R2, and R3. This system is the unique completion 
of all systems in this example. 

4.3 Axiomatization of Frequent Sets 

From Proposition 2, we can now easily derive the following axiomatization for 
frequent sets. 

Fl = 1 

F2 If K 2 C Ki, then px^ > PKi 

F3 Let A C /, M a bag of subsets of K. Then 

PK > 1 J . 

with k = miua (zx{deg{a,M.))^ 

If A: = 0, R3 should be interpreted as “p/c > 0” 



5 
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Theorem 2. Let S = K :: pk be a system of frequent sets over I. The 
following two statements are equivalent: 

— S is a complete system. 

— S satisfies FI, F2, and F3. 

5 Inference 

In the rest of the text we continue working with rare sets. The results obtained 
for rare sets can, just like the axiomatization, be carried over to frequent sets. 

In the previous section we introduced and proved an axiomatization for com- 
plete systems of rare and frequent sets. There is however still one problem with 
this axiomatization. R3 states a property that has to be checked for all bags 
over the subsets of K. This number of bags is infinite. In this section we show 
that it suffices to check only a finite number of bags: the minimal multi-covers. 
We show that the number of minimal multi-covers over a set is finite, and that 
they can be computed. 

We also look at the following problem: when an incomplete system is given, 
can we compute its completion using the axioms? We show that this is indeed 
possible. We use Rl, R2, and R3 as inference rules to adjust rareness values 
in the system; whenever we detect an inconsistency with one of the rules, we 
improve the system. When the rules are applied in a systematic way, this method 
leads to a complete system within a finite number of steps. 

Actually, the completion of a system of frequent sets can be computed in an 
obvious way by using linear programming. Indeed, when we look at the proof 
of theorem 1, we can compute the completion of the system of inequalities by 
applying linear programming. For all sets K, we can minimize pk with respect 
to a system of inequalities expressing that the frequencies obey the system of 
rare sets. Since the system of inequalities has polynomial size in the number 
of frequent itemsets, this algorithm is even polynomial in the size of the sy- 
stem. However, in association rule mining, it is very common that the number 
of itemsets becomes very large and thus the system of inequalities will in prac- 
tical situations certainly become prohibitive large. Therefore, solving the linear 
programming problem is a theoretical solution, but not a practical one. Also, as 
mentioned in [6], an axiomatization has as an advantage that it provides human- 
readable proofs, and that, when the inference is stopped before termination, still 
a partial solution is provided. 

5.1 Minimal Multi- covers 
Definition 6. 

— A fc-cover of a set S is a bag K over the subsets of S such that for all s € S, 
deg{s, K) = k. 

— A bag K over the subsets of a set S is a multi-cover of S if there exists an 
integer k such that K. is a k-cover of S. 
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— A k-cover K. of S is minimal if it cannot he decomposed as K = Ki IJ K 2 , 
with Ki and K 2 respectively k\- and k 2 ~covers of S, ki > 0 and > 0. 

Example 8. Let K = {A,B,C,D}. {{ l'AB,l'BC,l'CD,l'AD,l'ABCD }} is 
a 3-cover of K. It is not minimal, because it can be decomposed into the 
following two minimal multi-covers of K: ^ l'AB,l'BC,l'CD,l'AD 3 - and 
{{ V ABC D 

The new rule that replaces R3 states that it is not necessary to check all 
bags; we only need to check the minimal multi-covers. This gives the following 



R3’ Let AT C /, M a minimal k-cover of K. Then 

k ■ 

Theorem 3. Let S be a system of rare sets over I. The following statements 
are equivalent: 

1. S satisfies Rl, R2, and R3. 

2. S satisfies Rl, R2, and R3’. 

Sketch of the proof. (1) The direction Rl, R2, R3 implies Rl, R2, R3’ 
is trivial, since every fc-cover of K is also a bag over the subsets of K, where the 
minimal degree is k. 

(2) Suppose the system S satisfies Rl and R2, but violates R3. There exists 

y PL 

a set K and a bag K over the subsets of K, such that px > — , with 

k = minaeKdeg{a,K.). Starting from this bag, one can construct a minimal 
multi-cover of K, that violates R3’. We show this construction with an example. 
Suppose K = {[ AB,BC,ABC S’- Every element appears at least 2 times in 
K. We first construct a multi-cover from K, by removing elements that appear 
more than others. In this example, B appears 3 times, and all other elements 
appear only 2 times. We remove B from one of the sets in K, resulting in 
A, BC, ABC S’- The sum over K became smaller by this operation, since 
S satisfies R2. This multi-cover can be split into two different minimal multi- 
covers: Ki = -g A,BC S) and K 2 = ABC S- Because now = 

— , for at least one i, — is smaller than . 

Proposition 3. Let K he a finite set. The number of minimal multi-covers of 
K is finite and computable. 



The proof can be found in [3]. 
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5.2 Computing the Completion of a System with Inference Rules 

We prove that by applying Rl, R2, and R3’ as rules, we can compute the 
completion of any given system of rare sets. Applying for example rule R2 means 
that whenever we see a situation C K 2 , and the system states Ki : and 

K 2 '■ PK 21 and PK 2 < PKi, we improve the system by replacing Ki : px^ by 
Ki : PX 2 ■ It is clear that Rl can only be applied once; R2 and R3 never create 
situations in which Rl can be applied again. 

R2 is a top-down operation^ in the sense that the rareness values of smaller 
sets is adjusted using values of bigger sets. So, for a given system S we can 
easily reach a fixpoint for rule R2, by going top-down; we first try to improve 
the frequencies of the biggest itemsets, before continuing with the smaller ones. 

R3 is a bottom-up operation] values of smaller sets are used to adjust the 
values of bigger sets. So, again, for a given system S, we can reach a fixpoint for 
rule R3, by applying the rule bottom-up. 

A trivial algorithm to compute the completion of a system is the following: 
apply Rl, and then keep applying R2 and R3 until a fixpoint is reached. Clearly, 
the limit of this approach yields a complete system, but it is not clear that a 
fixpoint will be reached within a finite number of steps. Moreover, there are 
examples of situations in which infinite loops are possible. In Fig. 3, such an 
example is given. The completion of the first system, is clearly all rareness values 
equal to zero, because for every matrix satisfying the system, none of the rows 
have a zero in AB, and none have a zero in BC, so there are no zeros at all 
in the matrix. When we keep applying the rules as in Fig. 3, we never reach 
this fixpoint, since in step 2n, the value for ABC is (|) . This is however not 
a problem; we show that when we apply the rules R2 and R3 in a systematic 
way, we always reach a fixpoint within a finite number of steps. This systematic 
approach is illustrated in Fig. 4. We first apply R2 top-down until we reach 
a fixpoint for R2, and then we apply R3 bottom-up until we reach a fixpoint 
for R3. The general systematic approach is written down in Fig. 5. We prove 
that for every system these two meta-steps are all there is needed to reach the 
completion. 

Definition 7. Let I he a set of items, J Q I, and S = K : px a system 
of rare sets over I. The projection of S on J, denoted proj{S, J), is the system 
S' = {kcj K :px- 

Lemma 2. Let L he a set of items, J C I, and S = K : px a system of 
rare sets over L. Lf S satisfies R2, then pro j {C (S) , J) = C{proj{S, J)) . 

Theorem 4. The algorithm in Fig. 5 computes the completion of the system of 
rare sets S. 

Sketch of the proof. Let I = {A, B, C}, and S' be a system of rare sets over 
I. After the top-down step, the resulting system satisfies R2. First we apply 
R3 to adjust the value of A. Because S satisfies R2, and after application of 
R3 on A, the system {^ : 0,A : pa} is complete, we cannot further improve 
on A] proj{C{S),{A}) = C{S,proj{S,{A})). We can use the same argument 
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ABC : 1 



R3 



ABC : 



R2 



ABC : i 



R3 



AB -.QAC :IBC AB ■. Q AC : 1 BC : 0 AB : OAC : |SC : 0 
A : i B : i C : i A : i B : i C : i A : i B : | C : | 

' 4 > : 

ABC: I 



^ • 0 

ABC: I 



R2 



<j> : 0 
ABC-.\ 



R3 



I ^ I K ^ I \ 

AB : 0 AC : I BC : 0 AB : OAC : |bC : 0 AB : 0 AC : 3 BC : 0 

, ^ / x2 / , I ><r ^ I \'x ~i>c \ 

• i B : i C : i 4 • i R • i C • 1 



I 'X X ' 

A:i B:i C:i 



A : 



B:f C:, 



>':(T 



4 > : 0 



4 > : 0 



Fig. 3. “Random” application of the rules can lead to infinite loops 
ABC : 1 ABC : 1 ABC : 1 

AB:0AC:1BC:0 AB:0AC:1BC:0 AB:0AC:iBC:0 

'XX', '><^X'i 

A:i B:| C:| A:0 B:| C:| A:0 B:0 C:| 

cji : 6^ 



ABC : 1 



R3 



ABC : 1 



R3 



ABC : 0 



AB :0AC : ^BC :0 AB : OAC : OBC : 0 AB:0AC:0BC:0 



I 

A : 0 B : 0 C : 0 



I ><r I 

A : 0 B : 0 C : 0 



I ><r I 

A : 0 B : 0 C : 0 



<t> : 0 



(f> : 0 



4 > : 0 



Fig. 4. Systematic application of the rules avoids infinite computations 



for B and C. Then we apply R3 to adjust the value of AC. After this step, 
{(j) : 0, A : pa,B : pb,AC : pac} satisfies R3. This system also satisfies R2, 
because otherwise we could improve on A or on B, and we just showed that we 
cannot further improve on A or B. Thus, the system proj{S,{A,C}) is closed, 
and thus we cannot further improve on AC. This way, we iteratively go up, and 
finally we can conclude that S must be complete after the full bottom-up step. 



6 Summary and Further Work 

We presented an axiomatization for complete systems of frequent sets. As an 
intermediate stage in the proofs, we introduced the notion of a system of rare 
sets. The axiomatization for rare sets contained three rules Rl, R2, and R3. 
From these rules we could easily derive the axiomatization, FI, F2, and F3 for 
frequent sets. Because rule R3 yields a condition that needs to be checked for an 
infinite number of bags, we replaced R3 by R3’. We showed that the completion 
can be computed by applying Rl, R2, and R3’ as inference rules. If these rules 
are applied first top-down, and then bottom-up, the completion is reached within 
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Close(S') 

= 0 

TopDown(S) 

BottomUp(S) 



TopDown(S') 

for i = n downto 1 do 

for all itemsets K of cardinality i do 
make pk = miriKQLijiL) 



BottomUp(5') 

for i = 1 to n do 

for all itemsets K of cardinality i do 

make pk = min^^ minimal fe-cover of k 



X/K'gK 

k 



Pk' 



Fig. 5. Algorithm Close for finding the completion of the system S = {kc/ ^ ■ Pk 
over 7 = {/i, . . . ,/„} 

a finite number of steps. In the future we want to study an axiomatization for 
systems in which not for every set a frequency is given. For some preliminary 
results on these sparse systems, we refer to [3]. Another interesting topic is 
expanding the axiomatization to include association rules and confidences. 
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Abstract. In information-integration systems, source relations often 
have limitations on access patterns to their data; i.e., when one must 
provide values for certain attributes of a relation in order to retrieve its 
tuples. In this paper we consider the following fundamental problem: can 
we compute the complete answer to a query by accessing the relations 
with legal patterns? The complete answer to a query is the answer that 
we could compute if we could retrieve all the tuples from the relations. 
We give algorithms for solving the problem for various classes of queries, 
including conjunctive queries, unions of conjunctive queries, and con- 
junctive queries with arithmetic comparisons. We prove the problem is 
undecidable for datalog queries. If the complete answer to a query can- 
not be computed, we often need to compute its maximal answer. The 
second problem we study is, given two conjunctive queries on relations 
with limited access patterns, how to test whether the maximal answer to 
the first query is contained in the maximal answer to the second one? We 
show this problem is decidable using the results of monadic programs. 



1 Introduction 

The goal of information-integration systems (e.g., [3,20,25]) is to support se- 
amless access to heterogeneous data sources. In these systems, a user poses a 
query on a mediator [26], which computes the answer by accessing the data at 
the underlying source relations. One of the challenges for these systems is to 
deal with the diverse capabilities of sources in answering queries. For instance, 
a source relation r(Star, Movie) might not allow us to retrieve all its data “for 
free.” Instead, the only way of retrieving its tuples is by providing a star name, 
and then retrieving the movies of this star. In general, relations in these systems 
may have limitations on access patterns to their data; i.e., one must provide 
values for certain attributes of a relation to retrieve its tuples. There are many 
reasons for these limitations, such as restrictive web search forms and concerns 
of security and performance. 

In this paper we first study the following fundamental problem: Given a query 
on relations with limited access patterns, can we compute the complete answer to 
the query by accessing the relations with legal patterns ? The complete answer to 
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the query is the answer that we could compute if we could retrieve all the tuples 
from the relations. Often users make decisions based on whether the answers 
to certain queries are complete or not. Thus the solution to this problem is 
important for the decision support and analysis by users. The following example 
shows that in some cases we can compute the complete answer to a query, even 
though we cannot retrieve all the tuples from relations. 

Example 1. Suppose we have two relations r{Star, Movie) and s{Movie, Award) 
that store information about movies and their stars, and information about 
movies and the awards they won, respectively. The access limitation of relation 
r is that each query to this relation must specify a star name. Similarly, the access 
limitation of s is that each query to s must specify a movie name. Consider the 
following query that asks for the awards of the movies in which Fonda starred: 

Qi : ans{A) r{fonda, M), s{M, A) 

To answer Qi, we first access relation r to retrieve the movies in which Fonda 
starred. For each returned movie, we access relation s to obtain its awards. 
Finally we return all these awards as the answer to the query. Although we 
did not retrieve all the tuples in the two relations, we can still claim that the 
computed answer is the complete answer to Qi. The reason is that all the tuples 
of relation r that satisfy the first subgoal were retrieved in the first step. In 
addition, all the tuples of s that satisfy the second subgoal and join with the 
results of the first step were retrieved in the second step. 

However, if the access limitation of the relation r is that each query to r must 
specify a movie title (not a star name), then we cannot compute the complete 
answer to query Qi. The reason is that there can be a movie that Fonda starred, 
but we cannot retrieve the tuple without knowing the movie. 

In general, if the complete answer to a query can be computed for any data- 
base of the relations in the query, we say that the query is stable. For instance, 
the query Qi in Example 1 is a stable query. As illustrated by the example, we 
might think that we can test the stability of a query by checking the existence of 
a feasible order of all its subgoals, as in [9,28]. An order of subgoals is feasible if 
for each subgoal in the order, the variables bound by the previous subgoals pro- 
vide enough bound arguments that the relation for the subgoal can be accessed 
using a legal pattern. However, the following example shows that a query can 
be stable even if such a feasible order does not exist. 

Example 2. We modify query Qi slightly by adding a subgoal r{S,M), and have 
the following query: 

Q 2 : ans{A) r{fonda,M),s{M,A),r{S,M) 

This query does not have a feasible order of all its subgoals, since we cannot 
bind the variable S in the added subgoal. However, this subgoal is actually 
redundant, and we can show that Q 2 is equivalent to query Q\. That is, there 
is a containment mapping [4] from Q 2 to Qi, and vice versa. Therefore, for any 
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database of the two relations, we can still compute the complete answer to Q 2 
by answering Qi. 

Example 2 suggests that testing stability of a conjunctive query is not just 
checking the existence of a feasible order of all its subgoals. In this paper we 
study how to test stability of a variety of queries. The following are the results: 

1. We show that a conjunctive query is stable iff its minimal equivalent query 
has a feasible order of all its subgoals. We propose two algorithms for testing 
stability of conjunctive queries, and prove this problem is AfP-complete (Sec- 
tion 3). 

2. We study stability of finite unions of conjunctive queries, and give similar 
results as conjunctive queries. We propose two algorithms for testing stability 
of unions of conjunctive queries, and prove that stability of datalog queries 
is undecidable (Section 3). 

3. We propose an algorithm for testing stability of conjunctive queries with 
arithmetic comparisons (Section 4). 

4. We show that the complete answer to a nonstable conjunctive query can be 
computed for certain databases. We develop a decision tree (Figure 1) to 
guide the planning process to compute the complete answer to a conjunctive 
query (Section 5). 

In the cases where we cannot compute the complete answer to a query, we 
often want to compute its maximal answer. The second problem we study is, 
given two queries on relations with limited access patterns, how to test whether 
the maximal answer to the first query is contained in the maximal answer to the 
second one? Clearly the solution to this problem can be used to answer queries 
efficiently. 

Given a conjunctive query on relations with limited access patterns, [8,17] 
show how to construct a recursive datalog program [24] to compute the ma- 
ximal answer to the query. That is, we can retrieve tuples from relations by 
retrieving as many bindings from the relations and the query as possible, then 
use the obtained tuples to answer the query. For instance, consider the relation 
s{Movie, Award) in Example 1. Suppose we have another relation dm{Movie) 
that provides movies made by Disney. We can use these movies to access relation 
s to retrieve tuples, and use these tuples to answer a query on these relations. 

To test whether the maximal answer to a conjunctive query is contained 
in the maximal answer to another conjunctive query, we need to test whether 
the datalog program for the first one is contained in that for the second one. 
Since containment of datalog programs is undecidable [23] , our problem of query 
containment seems undecidable. However, in Section 6 we prove this containment 
problem is decidable using the results of monadic programs [6]. Our results 
extend the recent results by Millstein, Levy, and Friedman [19], since we loosen 
the assumption in that paper. We also discuss how to test the containment 
efficiently when the program for a query in the test is inherently not recursive. 
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2 Preliminaries 

Limited access patterns of relations can be modeled using binding patterns [24] . 
A binding pattern of a relation specifies the attributes that must be given values 
(“bound”) to access the relation. In each binding pattern, an attribute is adorned 
as “6” (a value must be specified for this attribute) or “/” (the attribute can be 
free). For example, a relation r{A,B,C) with the binding patterns {bff,ffb} 
requires that every query to the relation must either supply a value for the first 
argument, or supply a value for the third argument. 

Given a database D of relations with binding patterns and a query Q on 
these relations, the complete answer to Q, denoted by ANS{Q, D), is the query’s 
answer that could be computed if we could retrieve all tuples from the relations. 
However, we may not be able to retrieve all these tuples due to the binding 
patterns. The following observation serves as a starting point of our work. 

If a relation does not have an all-free binding pattern, then after some 
finite source queries are sent to the relation, there can always be some 
tuples in the relation that have not been retrieved, because we did not 
obtain the necessary bindings. 



Definition 1. (stable query) A query on relations with binding patterns is stable 
if for any database of the relations, we can compute the complete answer to the 
query by accessing the relations with legal patterns. 

We assume that if a relation requires a value to be given for a particular 
argument, the domain of the argument is infinite, or we do not know all the 
possible values for this argument. For example, the relation r(Star, Movie) in 
Example 1 requires a star name, and we assume that we do not know all the 
possible star names. As a result, we do not allow the “strategy” of trying all 
the (infinite) possible strings as the argument to test the relation, since this 
approach does not terminate. Instead, we assume that each binding we use to 
access a relation is either from a query, or from the tuples retrieved by another 
access to a relation, while the value is from the appropriate domain. 

Now the fundamental problem we study can be stated formally as follows: how 
to test the stability of a query on relations with binding patterns ? As we will see 
in Section 3, in order to prove a query is stable, we need to show a legal plan that 
can compute the complete answer to the query for any database of the relations. 
On the other hand, in order to prove that a query Q is not stable, we need to 
give two databases Di and D 2 , such that they have the same observable tuples. 
That is, by using only the bindings from the query and the relations, for both 
databases we can retrieve the same tuples from the relations. However, the two 
databases yield different answers to query Q, i.e., ANS{Q, Di) ^ ANS{Q, D 2 ). 
Therefore, based on the retrievable tuples from the relations, we cannot tell 
whether the answer computed using these tuples is the complete answer or not. 
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3 Stability of Conjunctive Queries, Unions of Conjunctive 
Queries, and Datalog Queries 

In this section we develop two algorithms for testing stability of a conjunctive 
query, and prove this problem is AfP-complete. We also propose two algorithms 
for testing stability of a finite union of conjunctive queries. We prove that sta- 
bility of datalog queries is undecidable. 



3.1 Stability of Conjunctive Queries 

A conjunctive query (CQ for short) is denoted by: 

h{X) 5i(Ai),...,5„(A„) 

In each subgoal gi(Xi), predicate gi is a relation, and every argument in Xi is 
either a variable or a constant. The variables X in the head are called distin- 
guished variables. We use names beginning with lower-case letters for constants 
and relation names, and names beginning with upper-case letters for variables. 

Definition 2. (feasible order of subgoals) Some subgoals gi{Xi), . . . ,gk{Xk) in 
a CQ form a feasible order if each subgoal gi^Xf) in the order is executable; that 
is, there is a binding pattern p of the relation gi, such that for each argument A 
in gi{Xi) that is adorned as b in p, either A is a constant, or A appears in a 
previous subgoal. A CQ is feasible if it has a feasible order of all its subgoals. 

The query Qi in Example 1 is feasible, since {r{fonda,M),s{M,A)) is a 
feasible order of all its subgoals. The query Q 2 is not feasible, since it does 
not have such a feasible order. A subgoal in a CQ is answerable if it is in a 
feasible order of some subgoals in the query. The answerable subgoals of a CQ 
can be computed by a greedy algorithm, called the Inflationary algorithm. That 
is, initialize a set d>a of subgoals to be empty. With the variables bound by the 
subgoals in <d>a, whenever a subgoal becomes executable by accessing its relation, 
add this subgoal to Repeat this process until no more subgoals can be added 
to ^a, and <Pa will include all the answerable subgoals of the query. Clearly a 
query is feasible if and only if all its subgoals are answerable. The following 
lemma shows that feasibility of a CQ is a sufficient condition for its stability. 

Lemma 1. A feasible CQ is stable. That is, if a CQ has a feasible order of 
all its subgoals, for any database of the relations, we can compute the complete 
answer to the query. ^ 



Corollary 1. A CQ is stable if it has an equivalent query that is feasible. 



^ We do not provide all the proofs of the lemmas and theorems in this paper due to 
space limitations. Refer [15,16] for details. 
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The query Q 2 in Example 2 is the minimal equivalent query of Qi. A CQ is 
minimal if it has no redundant subgoals, i.e., removing any of its subgoals will 
yield a nonequivalent query. It is known that each CQ has a unique minimal 
equivalent up to renaming of variables and reordering of subgoals, which can be 
obtained by deleting its redundant subgoals [4]. Now we give two theorems that 
suggest two algorithms for testing the stability of a CQ. 

Theorem 1. A CQ is stable iff its minimal equivalent is feasible. 

By Theorem 1, we give an algorithm CQstable for testing the stability of 
a CQ Q. The algorithm first computes the minimal equivalent Qm of Q by 
deleting the redundant subgoals in Q. Then it uses the Inflationary algorithm to 
test the feasibility of Q^. If Qm is feasible, then Q is stable; otherwise, Q is not 
stable. The complexity of the algorithm CQstable is exponential, since we need 
to minimize the CQ first, which is known to be AfP-complete [4]. There is a 
more efficient algorithm that is based on the following theorem. 

Theorem 2. Let Q be a CQ on relations with binding patterns and Qa be the 
query with the head of Q and the answerable subgoals ofQ. Then Q is stable iff 
Q and Qa are equivalent as queries, i.e., Q = Qa. 

Theorem 2 gives another algorithm CQstable* for testing the stability of a 
CQ Q as follows. The algorithm first computes all the answerable subgoals of Q. 
If all the subgoals are answerable, then Q is stable, and we do not need to test 
any query containment. Otherwise, the algorithm constructs the query Qa with 
these answerable subgoals and the head of Q. It tests whether Qa is contained 
in Q (denoted Qa E Q) by checking if there is a containment mapping from 
Q to Qa. If so, since Q E Qa is obvious, we have Q = Qa, and Q is stable; 
otherwise, Q is not stable. The algorithm CQstable* has two advantages: (1) 
If all the subgoals of Q are answerable, then we do not need to test whether 
Qa E Q, thus its time complexity is polynomial in this case. (2) As we will see 
in Section 4, this algorithm can be generalized to test stability of conjunctive 
queries with arithmetic comparisons. 

Theorem 3. The problem of testing stability of a CQ is MV-complete. 

Proof, (sketch) Algorithm CQstable shows that the problem is in ffV. It is known 
that given a CQ Q and a CQ Q' that has a subset of the subgoals in Q, the 
problem of deciding whether Q' E Q is AfP-complete [4] . It can be shown that 
this problem can be reduced to our stability problem in polynomial time [15]. 

3.2 Stability of Unions of Conjunctive Queries 

Let Q = QiU---UQ„ be a finite union of CQ’s (UCQ for short), and all its 
CQ’s have a common head predicate. It is known that there is a unique minimal 
subset of Q that is its minimal equivalent [22]. 

Example 3. Suppose we have three relations r, s, and p, and each relation has 
only one binding pattern bf. Consider the following three CQ’s: 
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Qi: ans(X) r(a, X) 

Q2: ans(X) r(a,X),p(Y, Z) 

Q3: ans(X) r(a,X),s(X,V),p(Y,Z) 
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Clearly Q3 E Q2 E Qi- Queries Qi and Q3 are both stable (since they are 
both feasible), while query Q2 is not. Consider the following two UCQ’s: Qi = 
Q1UQ2UQ3 and Q2 = Q2^Qs- Qi has a minimal equivalent Qi, and Q2 has a 
minimal equivalent Q2- Therefore, query Qi is stable, and Q2 is not. 

In analogy with the results for CQ’s, we have the following two theorems: 

Theorem 4 . Let Q he a UCQ on relations with binding patterns. Q is stable iff 
each query in the minimal equivalent of Q is stable. 



Theorem 5 . Let Q he a UCQ on relations with binding patterns. Let Qs be the 
union of all the stable queries in Q. Then Q is stable iff Q and Qs are equivalent 
as queries, i.e., Q = Qg. 

Theorem 4 gives an algorithm UCQstable for testing stability of a UCQ Q 
as follows. Compute the minimal equivalent Qm of Q, and test the stability of 
each CQ in Qm using the algorithm CQstable or CQstable*. If all the queries in 
Qm are stable, query Q is stable; otherwise, Q is not stable. Theorem 5 gives 
another algorithm UCQstable* for testing stability of a UCQ Q as follows. Test 
the stability of each query in Q by calling the algorithms CQstable or CQstable*. 
If all the queries are stable, then Q is stable. Otherwise, let Qs be the union of 
these stable queries. Test whether Q E Qs, be., Q is contained in Qs as queries. 
If so, Q is stable; otherwise, Q is not stable. The advantage of this algorithm is 
that we do not need to test whether Q E Qs if all the queries in Q are stable. 



3.3 Stability of Datalog Queries 

We want to know that given a datalog query on EDB predicates [24] with binding 
patterns, can we compute the complete answer to the query by accessing the EDB 
relations with legal patterns? Not surprisingly, this problem is not decidable. 

Theorem 6. Stability of datalog queries is undecidahle. 

Proof. (Sketch) Let Pi and P2 be two arbitrary datalog queries. We show that 
a decision procedure for the stability of datalog queries would allow us to decide 
whether Pi E ^2- Since containment of datalog queries is undecidable, we prove 
the theorem.^ Let all the EDB relations in the two queries have an all-free binding 
pattern; i.e., there is no restriction of retrieving tuples from these relations. 
Without loss of generality, we can assume that the goal predicates in Pi and 
P2, named pi and p2 respectively, have arity m. Let Q be the datalog query 
consisting of all the rules in Pi and P2, and of the rules: 

The idea of the proof is borrowed from [7], Chapter 2.3. 



2 
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T 1 . QjTls(^Xi , . . . , X^n') ■ Pi (-^1 j • ■ ■ 5 ^m') ? ^(■^) 

r 2 '. ans{Xi,...,Xm) p 2 {Xi, . . . , X^n) 

where e is a new 1-ary relation with the binding pattern b. Variable Z is a new 
variable that does not appear in Pi and P 2 . We can show that P± C P 2 if and 
only if query Q is stable. 

In [15] we give a sufficient condition for stability of datalog queries. We show 
that if a set of rules on EDB relations with binding patterns has a feasible 
rule/goal graph [24] w.r.t. a query goal, then the query is stable. 

4 Stability of Conjunctive Queries with Arithmetic 
Comparisons 

In this section we develop an algorithm for testing the stability of a conjunctive 
query with arithmetic comparisons (CQAC for short). This problem is more 
challenging than conjunctive queries, because equalities may help bind more 
variables, and then make more subgoals answerable. In addition, a CQAC may 
not have a minimal equivalent formed from a subset of its own subgoals, as 
shown by Example 14.8 in [24]. Therefore, we cannot generalize the algorithm 
CQstable to solve this problem. 

Assume Q is CQAC. Let 0{Q) be the set of ordinary (uninterpreted) subgoals 
of Q that do not have comparisons. Let C{Q) be the set of subgoals of Q that are 
arithmetic comparisons. We consider the following arithmetic comparisons: <, 
<, =, >, >, and In addition, we make the following assumptions about the 
comparisons: (1) Values for the variables in the comparisons are chosen from an 
infinite, totally ordered set, such as the rationals or reals. (2) The comparisons 
are not contradictory, i.e., there exists an instantiation of the variables such that 
all the comparisons are true. In addition, all the comparisons are safe, i.e., each 
variable in the comparisons appears in some ordinary subgoal. 

Definition 3. (answerable subquery of a CQAC) The answerable subquery of 
a CQAC Q on relations with binding patterns, denoted by Qa, is the query in- 
cluding the head of Q, the answerable subgoals T>a of Q, and all the comparisons 
of the bound variables in <Ta that can be derived from C{Q). 

The answerable subquery Qa of a CQAC Q can be computed as follows. 
We first compute all the answerable ordinary subgoals <Pa of query Q using the 
Inflationary algorithm. Note that if Q contains equalities such as A = V, or 
equalities that can be derived from inequalities (e.g., if we can derive X < Y 
and X > Y, then X = Y), we need to substitute variable A by V before 
using the Inflationary algorithm to find all the answerable subgoals. Derive all 
the inequalities among the variables in <l>a from C{Q). Then Qa includes all the 
constraints of the variables in <Paj because C{Q) may derive more constraints 
that these variables should satisfy. For instance, assume variable A is bound in 
<Pa, and variable Y is not. If Q has comparisons A < V and V < 5, then variable 
A in Qa still needs to satisfy the constraint A < 5. 
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We might be tempted to generalize the algorithm CQstable* as follows. Given 
a CQAC Q, we compute its answerable subquery Qo- We test the stability of Q 
by testing whether Qa E Q, which can be tested using the algorithm in [11,29] 
(“the GZO algorithm” for short). However, the following example shows that 
this “algorithm” does not always work. 

Example 4- Gonsider query 

P : ans{Y) p{X),r{X, Y),r{A, B), A < B, X < A, A < Y 

where relation p has a binding pattern /, and relation r has a binding pattern 
bf. Its answerable subquery is 

Pa : ans{Y) p{X),r{X,Y),X < Y 

Using the GZO algorithm we know Pa % P. Therefore, we may claim that query 
P is not stable. However, actually query P is stable. As we will see shortly, P is 
equivalent to the union of the following two queries. 

Ti : ans{Y) p{X),r{X, Y),X <Y 

T 2 : ans{Y) p{Y),r{Y,Y),r{Y, B),Y < B 

Note both Ti and T 2 are stable, since all their ordinary subgoals are answerable. 

The above “algorithm” fails because the only case where Pa % P is when 
X = Y. However, X < A and A < Y will then force A = X = Y, and the 
subgoal r(A, B) becomes answerable! This example suggests that we need to 
consider all the total orders of the query variables, similar to the idea in [12].^ 

Theorem 7. Let Q he a CQAC, and C{Q) be the set of all the total orders of 
the variables in Q that satisfy the eomparisons of Q. For each A G L2{Q), let 
he the corresponding query that includes the ordinary subgoals of Q and all the 
inequalities and equalities of this total order A. Query Q is stable if and only if 
for all A G C{Q), Q^ C Q, where Qo ^^6 answerable subquery ofQ^. 

The theorem suggests an algorithm CQACstable for testing the stability of a 
GQAG Q as follows: 

1. Gompute all the total orders f2{Q) of the variables in Q that satisfy the 
comparisons in Q. 

2. For each A G C{Q): 

a) Gompute the answerable subquery of query Q^; 

b) Test Qa CQ hy calling the GZO algorithm; 

c) If Qa % Q, claim that query Q is not stable and return. 

3. Glaim that query Q is stable. 

® Formally, a total order of the variables in the query is an order with some equalities, 
i.e., all the variables are partitioned to sets Si, . . . , Sk, such that each Si is a set 
of equal variables, and for any two variables Xi G Si and Xj G Sj, if i < j, then 
Xi < Xj. 
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For a CQAC Q on a database D, if by calling algorithm CQACstable we know 
Q is stable, we can compute ANS{Q,D) by computing ANS{Q^,D) for each 
total order A in f2{Q), and taking the union of these answers. 

Example 5. The query P in Example 4 has the following 8 total orders: 

\i: X < A = Y < B-, X: X < A<Y < B-, Xs: X < A<Y ^ B; 

Xr. X < A< B <Y; X5: X = A = Y<B; Xq: X ^ A<Y < B- 

At: X = A<Y = B- As: X = A< B <Y. 

For each total order Aj, we write its corresponding query For instance: 

: ans{Y) p{X),r{X, Y),r(Y, B),X <Y,Y < B 

We then compute the answerable subquery . All the 8 answerable subqueries 
are contained in P. By Theorem 7, query P is stable. Actually, the union of all 
the answerable subqueries except P^^ is: 

ans{Y) p{X),r{X, Y),r{A, B), X < Y, A < B, X < A, A <Y 

whose answerable subquery is the query Ti in Example 4. In addition, P^® is 
equivalent to the query T2 . Since P is equivalent to the union of the 8 answerable 
subqueries, we have proved P = Ti U T2. 



5 Nonstable Conjunctive Queries with Computable 
Complete Answers 



So far we have considered whether the complete answer to a query can be com- 
puted for any database. In this section we show that for some databases, even if a 
CQ is not stable, we may still be able to compute its complete answer. However, 
the computability of its complete answer is data dependent, i.e., we do not know 
the computability until some plan is executed. The following is an example. 



Example 6. Consider the following queries on relations with binding patterns: 



Relations Binding patterns 

r(A,P,C) bff 

s{C,D) fb 

P{D) f 



Queries 

Qi : ans{B) r{a, B,C), s{C, D) 
Q2 : ans{D) :- r(a, B, C), s(C, D) 



The two queries have the same subgoals but different heads, and both are 
not stable. However, we can still try to answer query Qi as follows: send a query 
r(a, X, Y) to relation r. Assume we obtain three tuples: (a, &i, ci), (a, 62, C2), and 
(a,b2,C3). Thus, by retrieving these tuples, we know that the complete answer 
is a subset of {(61), (&2)}- Assume attributes A, B, C, and D have different 
domains. If relation p provides the tuples that allow us to retrieve some tuples 
(ci, c?i) and (c2, ^2) from s, we can know that {(61), (^2)} is the complete answer. 
On the other hand, if relation p does not provide tuples that allow us to compute 
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the answer {(6i), (^ 2 )}) we do not know whether we have computed the complete 
answer to Qi- 

We can try to answer query Q 2 in the similar way. After the first subgoal is 
processed, we also obtain the same three tuples from r. However, no matter what 
tuples are retrieved from relation s, we can never know the complete answer to 
Q 2 , since there can always be a tuple {ci,d!) in relation s that has not been 
retrieved, and d! is in the answer to Q 2 - For both Qi and Q 2 , if after processing 
the first subgoal we obtain no tuples from r that satisfy this subgoal, then we 
know that their complete answers are empty. 

An important observation on these two queries is that Qi’s distinguished 
variable B can be bound by the answerable subgoal r{a,B,C), while Q 2 ’s di- 
stinguished variable D cannot be bound. Based on this observation, we develop 
a decision tree (Figure 1) that guides the planning process to compute the com- 
plete answer to a CQ. The shaded nodes in the figure are where we can conclude 
about whether we can compute the complete answer. 

Now we explain the decision tree in details. Given a CQ Q and a database 
D, we first minimize a CQ Q by deleting its redundant subgoals, and compute 
its minimal equivalent Qm (arc 1 in Figure 1). Then we test the feasibility of 
the query Qm by calling the Inflationary algorithm; that is, we test whether Qm 
has a feasible order of all its subgoals. If so (arc 2 in Figure 1), Qm (thus Q) 
is stable, and its answer can be computed following a feasible order of all the 
subgoals in Qm- 

If Qm is not feasible (arc 3), we compute all its answerable subgoals and 
check if all the distinguished variables are bound by the subgoals <l>a- There are 
two cases: 

1. If all the distinguished variables are bound by the subgoals (arc 4), then 
the complete answer may be computed even if the supplementary relation [2, 
24] (denoted la) of subgoals d>a is not empty. We compute the supplementary 
relation I a of these subgoals following a feasible order of d>a- 

a) If la is empty (arc 5), then we know that the complete answer is empty. 

b) If la is not empty (arc 6), let be the projection of Q onto the di- 
stinguished variables. We use all the bindings from the query and the 
relations to retrieve as many tuples as possible. (See [8,17] for the de- 
tails.) Let denote all the nonanswerable subgoals. 

i. If for every tuple £ 1^, there is a tuple ta G la, such that the 
projection of ta onto the distinguished variables is t^ , and ta can 
join with some tuples for all the subgoals (tuple t^ is called 
satisfiable), then we know that 1^ is the complete answer to the 
query (arc 7). 

ii. Otherwise, the complete answer is not computable (arc 8). 

2. If some distinguished variables are not bound by the subgoals <d>a (arc 9), then 
the complete answer is not computable, unless the supplementary relation 
la is empty. Similarly to the case of arc 4, we compute la by following a 
feasible order of d>a- If la is empty (arc 10), then the complete answer is 
empty. Otherwise (arc 11), the complete answer is not computable. 
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While traversing the decision tree from the root to a leaf node, we may reach a 
node where we do not know whether the complete answer is computable until we 
traverse one level down the tree. Two planning strategies can be adopted at this 
kind of nodes: a pessimistic strategy and an optimistic strategy. A pessimistic 
strategy gives up traversing the tree once the complete answer is unlikely to 
be computable. On the contrary, an optimistic strategy is optimistic about the 
possibility of computing the complete answer, and it traverses one more level by 
taking the corresponding operations. 




Fig. 1. Decision tree for computing the complete answer to a conjunctive query 
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6 Query Containment in the Presence of Binding 
Patterns 

In the cases we cannot compute the complete answer to a query, we often want to 
compute the maximal answer to the query. In this section we study the following 
problem: given two conjunctive queries on relations with limited access patterns, 
how to test whether the maximal answer to the first query is contained in the 
maximal answer to the second one? We show this problem is decidable, and 
discuss how to test the query containment efficiently. 

For a CQ Q on relations TZ with binding patterns, let II{Q,TZ) denote the 
program that computes the maximal answer to the query by using only the 
bindings from query Q and relations TZ. It is shown in [8,17] how the program 
n{Q, TZ) is constructed. TJ{Q, TZ) can be a recursive datalog program, even if the 
query Q itself is not recursive. That is, we might access the relations repeatedly 
to retrieve tuples, and use the new bindings in these tuples to retrieve more 
tuples from the relations. Formally, our containment problem is: Given two con- 
junctive queries Q\ and Q 2 on relations TZ with limited access patterns, how to 
test whether TJ{Qi,TZ) G TJ{Q 2 ,TZ)? The following theorem shows that this pro- 
blem is decidable, even though containment of datalog programs is undecidable 
[23]. 

Theorem 8. For two conjunctive queries Q\ and Q 2 on relations TZ with bin- 
ding patterns, whether II{Qi,TZ) C II{Q 2 ,TZ) is decidable. 

Proof. (Sketch) We can show that II{Qi,TZ) and TJ{Q 2 ,TZ) are monadic data- 
log programs. A datalog program is monadic if all its recursive predicates [24] 
are monadic (i.e., with arity one); its nonrecursive IDB predicates can have ar- 
bitrary arity. Cosmadakis et al. [6] show that containment of monadic programs 
is decidable. Thus our containment problem is decidable. 

This containment problem is recently studied in [19]. They prove the same 
decidability result using a different technique. There are some differences between 
our approach to the decidability result and their approach. The decidability 
proof in that paper is based on the assumption that the set of initial bindings 
for the contained query is a subset of the initial bindings for the containing 
query. We loosen this assumption because the decidability result holds even if 
the two queries have different initial bindings. Another difference between the 
two approaches is that we assume that the contained query is a conjunctive 
query, while in [19] the contained query can be a recursive datalog query. Finally, 
[19] uses the source-centric approach to information integration [25], which is 
different from the query-centric approach [25] that is taken in our framework. 
However, we can easily extend our technique [16] to the source-centric approach. 

[6] involves a complex algorithm that uses tree-automata theory to test con- 
tainment of monadic programs. If one of the two programs in the test is bounded 
(i.e., it is equivalent to a finite union of conjunctive query), then the containment 
can be tested more efficiently using the algorithms in [4,5,22]. Therefore, we are 
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interested in how to test the boundedness of the program U{Q,TZ) for a query 
Q on relations TZ with binding patterns. 

It is shown in [6] that boundedness is also decidable for monadic datalog pro- 
grams, although it is not decidable in general [10]. However, testing boundedness 
of monadic programs also involves a complex algorithm [6]. In [16] we study this 
problem for the class of connection queries [17]. Informally, a connection query 
is a conjunctive query on relations whose schemas are subsets of some global 
attributes, and some values are given for certain attributes in the query. In [16] 
we give a polynomial-time algorithm for testing the boundedness of the datalog 
program for a connection query. 

7 Related Work 

Several works consider binding patterns in the context of answering queries using 
views [8,14,1]. Rajaraman, Sagiv, and Ullman [21] propose algorithms for an- 
swering queries using views with binding patterns. In that paper all solutions 
to a query compute the complete answer to the query; thus only stable que- 
ries are handled. Duschka and Levy [8] solve the same problem by translating 
source restrictions into recursive datalog rules to obtain the maximally-contained 
rewriting of a query, but the rewriting does not necessarily compute the query’s 
complete answer. Li et al. [18] study the problem of generating an executable 
plan based on source restrictions. [9,28] study query optimization in the pre- 
sence of binding patterns. Yemeni et al. [27] consider how to compute mediator 
restrictions given source restrictions. These four studies do not minimize a con- 
junctive query before checking its feasibility. Thus, they regard the query Q 2 in 
Example 2 as an unsolvable query. In [17] we study how to compute the maximal 
answer to a conjunctive query with binding patterns by borrowing bindings from 
relations not in the query, but the computed answer may not be the complete 
answer. As we saw in Section 5, we can sometimes use the approach in that pa- 
per to compute the complete answer to a nonstable conjunctive query. Levy [13] 
considers the problem of obtaining complete answers from incomplete databases, 
and the author does not consider relations with binding restrictions. 

Acknowledgments. We thank Foto Afrati, Mayank Bawa, Rada Chirkova, and 
Jeff Ullman for their valuable comments on this material. 
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Abstract. This paper presents a fully dynamic algorithm for maintai- 
ning the transitive closure of a directed graph. All updates and queries 
can be computed by constant depth threshold circuits of polynomial size 
(TC° circuits). This places transitive closure in the dynamic complexity 
class DynTC'^, and implies that transitive closure can be maintained 
in databases using updates written in a first order query language plus 
counting operators, while keeping the size of the database polynomial in 
the size of the graph. 



1 Introduction 

Many restricted versions of transitive closure are known to be dynamically main- 
tainable using first-order updates. In this paper we show that the transitive clo- 
sure of a relation can be dynamically maintained using a polynomial-size data 
structure, with updates computable by constant depth threshold circuits. We 
show that updating the data structure upon adding or deleting an edge is in the 
circuit complexity class TC°, described in Sect. 4. This means there is a first- 
order uniform family of constant depth threshold circuits, with size polynomial 
in the size of the graph, computing the new values of all the bits in the data 
structure^. 

Queries computed by TC° circuits are exactly those queries defined by first- 
order logic plus counting quantifiers, a class which contains those SQL queries 
in which no new domain elements are created [6]. Thus, our results show that 
the transitive closure of a relation can be maintained by ordinary SQL queries 
which use only polynomially sized auxiliary relations. The contents of these 
auxiliary relations are uniquely determined by the input relation, making this a 
memoryless algorithm; the state of the data structure does not depend on the 
order of the updates to the input relation. 

This paper is organized as follows. In Sect. 3 dynamic complexity classes 
and the dynamic complexity of transitive closure are defined. In Sect. 4 circuit 
complexity classes including TC° are described. Sect. 5 presents the dynamic 

^ All complexity classes are assumed to be FO-uniform unless otherwise stated. This 
is discussed in Sect. 4 
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algorithm for transitive closure. Sects. 6 and 7 show that all operations in the 
dynamic algorithm are in the complexity class TC° and can be written as SQL 
update queries. Sects. 8 and 9 extend the algorithm to allow the addition of 
new elements to the relation’s domain, and extend the algorithm to one dynami- 
cally maintaining the powers of an integer matrix. The final section offers some 
conclusions and directions for further work. 



2 Previous Related Work 

Patnaik and Immerman [9] introduced the complexity class DynFO (dynamic 
first-order) consisting of those dynamic problems such that each operation (in- 
sert, delete, change, or query) can be implemented as a first-order update to 
a relational data structure over a finite domain. This is similar to the defini- 
tion of a first-order incremental evaluation system (FOIES) by Dong and Su [4] . 
Patnaik and Immerman showed that problems including undirected reachability 
and minimum spanning trees, whose static versions are not first-order, are no- 
netheless in DynFO. Dong and Su showed that acyclic reachability is in DynFO 
[4]. The question of whether transitive closure is in DynFO remains open; this 
paper proves that transitive closure is in the larger dynamic complexity class 
DynTC° (dynamic TC°). 

Libkin and Wong have previously shown that the transitive closure of a rela- 
tion could be maintained using ordinary SQL updates while keeping a possibly 
exponential amount of auxiliary information [8]^. Our dynamic complexity class 
DynTC® is less powerful than the class SQLIES (SQL incremental evaluation sy- 
stems), that contains this algorithm. SQLIES captures all queries maintainable 
using SQL updates, which allow the creation of large numbers of new domain 
elements. DynTC° lacks the capability to introduce new constants, allowing as 
base type only a single finite domain, usually considered as the ordered integers 
from 1 to n, including arithmetic operations on that domain. Since dynamic 
computations in DynTC*^ use a constant number of relations of constant arity, 
they use an amount of auxiliary data polynomially bounded by the size of this 
domain. General SQL computations, by introducing new tuples with large inte- 
ger constants as keys, can potentially square the size of the auxiliary databases 
at each iteration, leading to exponential or doubly exponential growth of the 
amount of auxiliary data kept. 

A lower bound by Dong, Libkin, and Wong [3] shows that the transitive 
closure of a relation is not dynamically maintainable using first-order updates 
without auxiliary data. Our new upper bound is not strict; we still do not know 
if first-order updates using auxiliary data are sufficient to maintain transitive 
closure. 



^ Their algorithm creates a domain element for each path in the directed graph induced 
by the relation. This could be restricted to the set of simple paths, or paths of length 
less than the number of vertices, but it is still exponential for most graphs. 
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3 Problem Formulation 

A binary relation induces a directed graph, by considering the domain of the 
relation as the vertices of a graph and the pair (s, t) as a directed edge from s 
to t in that graph. A tuple (s, t) is in the transitive closure of a binary relation 
if and only if there is a path from s to t in the corresponding graph. In the 
remainder of this paper, we think of a binary relation and its transitive closure 
in this context, allowing us to use the conventional graph-theoretic language of 
vertices, edges, paths, and reachability. 

Computing the transitive closure of a relation is then equivalent to answering, 
for each pair of elements s and t, the decision problem REACH, which asks 
whether there is a path in the induced directed graph G from vertex s to vertex 
t. A dynamic algorithm maintaining the transitive closure of a relation consists 
of a set of auxiliary data structures and a set of update algorithms. These update 
the input relation, the auxiliary data structures, and the transitive closure of the 
input relation when a tuple is inserted into or deleted from the input relation. 
The complexity of this algorithm is the maximum complexity of any of these 
update algorithms. 

We give an dynamic algorithm for the equivalent problem REACH. We define 
an auxiliary data structure, counting the number of paths between s and t of each 
length k less than a given bound, and give algorithms for updating and querying 
this data structure. The operations on this data structure are to add a directed 
edge between two vertices and to delete a directed edge between two vertices. 
We also specify an algorithm for querying the data structure, asking whether 
there is a directed path between two vertices. The update algorithms are given 
as circuits computing the new values for the data structure’s bits, or computing 
the result of a query. The complexity of our algorithm is the circuit complexity 
class these circuits fall into. We will show that we can create a polynomial-size 
data structure with updates and queries computable by TC° circuits. 

This places the problem REACH in the dynamic complexity class DynTC°. 
In [9], the complexity class Dyn-C is defined for any static complexity class C. 
A summary of this definition is that a query on a input structure is in Dyn-C 
if there is some set of auxiliary data, of size polynomial in the size of the input 
structure, so that we can update the input structure and the auxiliary data with 
update queries in the complexity class C upon changes to the input structure. 
This additional data must allow us to answer the original query on the current 
state of the input structure with a query in complexity class C as well. 

Remark 1. In specifying the operations for the dynamic version of REACH, we 
did not include operations to add or delete vertices. As this approach derives 
from finite model theory, we conceive of dynamic REACH as being a family of 
problems, parameterized by the number of graph vertices, n. We show in Sect. 8 
that this algorithm can be modified to yield a SQLIES which allows addition and 
deletion of vertices (new domain elements) while keeping the size of the auxiliary 
relations polynomial in the size of the input relation. This modified problem can 
no longer be categorized as being in DynTC*^, however; it is no longer a dynamic 
problem of the type categorized by the dynamic complexity classes Dyn-C. 
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4 The Parallel Complexity Class TC° 

In static complexity theory, many natural low-level complexity classes have been 
parallel complexity classes, containing those problems which can be solved by 
idealized massively parallel computers in polylogarithmic or constant time. One 
model for these computations is as circuits made up of Boolean gates taking 
the values 1 and 0. A circuit is an acyclic directed graph whose vertices are the 
gates of the circuit and whose edges are the connections between gates. There 
is an input gate for each bit of the input, and a single output gate. Some of 
the most important parallel complexity classes are those defined by circuits with 
polynomial size and constant or logarithmic depth. The relations between these 
complexity classes are currently known to include the following inclusions: 

AC° C ThC° C NC^ C L C NL C AC^ C ThC^ (1) 

The classes NC^, AC^,and ThC^ are classes of logarithmic depth, polynomial- 
size circuits. All these circuits contain AND, OR, and NOT gates, but the class 
NC^ contains only AND and OR gates with 2 inputs, while the class AC^ contains 
AND and OR gates with arbitrarily many inputs, bounded only by the total 
number of gates in the circuit. The class ThC^ contains threshold gates as well 
as AND and OR gates. The classes AC*^,and ThC*^ are the corresponding classes 
of constant depth circuits. 

The circuit complexity class TC*^ contains all decision problems computed by 
a family of constant depth circuits containing AND, OR, and threshold gates, 
all with unbounded fan-in, as well as NOT gates. There is one circuit for each 
value of the input size, n, and there is a single constant and a single polynomial 
in n such that these circuits have a size (number of gates) bounded by that 
polynomial and depth bounded by that constant. The addition of threshold 
gates distinguishes these circuits from the circuit class AC°, which contain only 
AND, OR, and NOT gates. A threshold gate accepts if more than a certain 
number of its inputs are true; it may have up to polynomially many inputs, 
and its threshold may be any number. Thus the majority function which is 1 
iff more than half of its inputs are 1 is computed by a threshold gate with n 
inputs and threshold -I- 1. Conversely, by adding enough dummy inputs, set 
to the constants 0 or 1, a majority gate can simulate a threshold gate with any 
threshold. 

The importance of the classes AC° and ThC° to this paper is that first-order 
queries can be decided by AC° circuits, and that ThC° is the smallest important 
complexity class containing AC°. It has been shown to strictly contain AC° be- 
cause important problems including parity, majority, and integer multiplication 
have been shown to be computable by ThC° circuits but not by AC° circuits. 

The classes L and NL denote deterministic and nondeterministic logspace 
computability by Turing machines. The class NL is relevant because the static 
version of REACH is a complete problem for this complexity class. This shows 
that the dynamic complexity of REACH is potentially significantly smaller than 
its static complexity. Finally, the class ThC^ is important because the ThC° 
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circuits we construct will be so complex that we will need to prove that they can 
be constructed by ThC^ circuits, in order to eventually show that our algorithm 
can be implemented by SQL queries. 

The complexity of constructing a circuit is called the uniformity of that 
circuit, and circuits whose pattern of connections can be specified by a first- 
order formula will be called FO-uniform. By this, we mean that, if our gates 
are numbered from 0 to n — 1, there a first-order formula (j){i,j) giving the edge 
relation which states whether the output of gate i is connected to an input of 
gate j. This formula must quantify only over variables taking values from 0 to 
n — 1, and may use addition and multiplication operations and equality relations. 
Similar formulas must specify the type of each gate and which gates are inputs 
and the output from the circuit. 

Repeated integer multiplication, which we require in our algorithm, is the 
most notorious example of a problem for which ThC*^ circuits are known, but 
no FO-uniform ThC° circuits are known. For the remainder of this paper, we 
will assume that all circuits are FO-uniform unless we state otherwise explicitly. 
For example, will say that a circuit is in ThC^-uniform ThC'^, meaning that the 
circuit’s description can be computed by a (FO-uniform) ThC^ circuit. 

The inclusions in (1) hold for the corresponding classes of FO-uniform circuits 
as well, and we have the additional property that FO-uniform TC° = FO, the 
class of decision problems definable with first-order formulas. If a family of TC° 
circuits is FO-uniform, the computation they perform can also be expressed by 
a first order formula with the addition of majority quantifiers, (Mx), specifying 
that for at least half of the distinct values for the variable bound by the quantifier, 
the following subformula is true [1]. We will show in Sect. 6, first that we can 
construct TC^-uniform TC*^ circuits to implement our updates, then that they 
can be replaced by FO-uniform TC° circuits. 

5 A Dynamic Algorithm for REACH 

We describe our algorithm by specifying the data structures it maintains, then 
describing the updates necessary to maintain those data structures in a consi- 
stent state. 

5.1 Data Structures 

Our input structure is a directed graph on n vertices, identified with the numbers 
0 to n — 1. It is represented by its adjacency matrix, an n by n array of bits Cij, 
where Cij is I if there is a directed edge from i to j, 0 otherwise. The auxiliary 
information we will keep is the set of numbers pij{k), where pij{k) is the 
number of paths of length k from i to j in the graph. Note that Piy(O) = 1, 
Pij(l) = 6ij, and that our paths are not necessarily simple paths; they may 
include cycles. Since the number of paths of length k from i to j is bounded by 
n*, pij{k) is a number with at most klogn bits. We will only consider paths of 
length n — 1 or less. This is sufficient to decide our queries, since if two vertices 
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are connected, they are connected by a path of length n — 1 or less. Therefore, 
for k < n, Pi,j{k) < n". We will pack the values Pij{k) for a fixed i and j and 
all A: in {0, . . . , n — 1} into an bit integer aij: 

n—1 

^ ^ 5 (2) 

2 . 

where r = 2" is large enough so that in the binary representation of aij, 
the bits of Pij{k) and Pij{k+ 1) are well separated by zeroes. Note that there is 
no computation involved in creating the aij from the values pij{k), 0 < k < n. 
We can regard the same array of bits as n numbers of bits or as one number 
with bits, depending on whether we want to reference the pij{k)^s or aij. 
We will also regard the integer aij as if it were a polynomial in r, in order to 
explain the proofs of the update computations more clearly. As the coefficients 
of this polynomial in r are small compared to r, there are no carries from one 
power of r to another, and thus the coefficients add and multiply as if we were 
multiplying polynomials. Since the coefficient of in Os,t counts the number of 
paths with length k, it can be seen as a generating function counting the paths 
from s to t according to their weight. Since this polynomial is truncated after 
the nth term, it is a truncated generating function. 

5.2 Adding an Edge 

We now show how to calculate the updated value for as,t, upon adding an edge 
(i,j). We will see that the new value for which we denote can be 
expressed as a polynomial in the previous values of Og,*, ag,i, aj^i, and aj^t- 

Lemma 1. We can calculate the new values of all Us^t after insertion of an edge 
(i,j) into our graph as follows: 

n—2 

^'s,t = ^s,t + ^ as^^r{raj^i)^aj^t mod (3) 

k^O 

Proof. We show that the sum over k in the above expression counts the paths 
going through the added edge, (i,j), one or more times, by showing that each 
term counts the number of paths using that edge exactly k times. For all m < n, 
the coefficient of r™ in as,ir{raj^i)^aj^t (regarded as a polynomial in r) is the 
number of paths of length m from s to t passing through edge (i,j) exactly 
k+1 times. Once we have shown this, then the expression (^s,i'r{raj^i)^aj^t 

can be seen to count all paths of length less than n passing through (i, j) any 
number of times. These counts appear as the coefficients of r°, . . . r"“^, as in all 
our quantities Os^t- This sum is thus the correct additive update to our counts 
Os^t upon adding the edge (z, j). 

Each term counts the paths going through (z,j) k times by the standard 
combinatorial technique of multiplying generating functions. If polynomial P 
counts the elements of set A, with the coefficient of counting the number of 
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elements with weight k, and Q similarly counts the elements of set B, then the 
coefficient of in PQ counts the numbers of pairs of one element from A with 
one element from B, with total weight k. Therefore, the term as^ir{raj^i)^aj^tj 
as a product of a^j,and aj^t, counts the number of ways of choosing 

a path from s to i, k paths from j to i, and a path from j to t, weighted by 
their total weight plus an additional weight of fc + 1 . But these choices of a set of 
paths are in one-to-one correspondence with the unique decompositions of paths 
from s to t passing through edge (i,j) k + 1 times into their k + 2 subpaths 
separated by uses of the edge as shown in Fig. 1. The total length of each 
path is the sum of the lengths of the subpaths, plus the additional fc -I- 1 uses of 
the edge (t, j). This one-to-one correspondence is only with the paths made up 
of subpaths of length less than n, but all paths of length less than n of course 
only contain subpaths of smaller lengths. Thus the coefficient of r™ in that term 
correctly counts the number of paths from s to t using edge (i,j) k + 1 times. 




Fig. 1. the decomposition of a path passing through (i,j) k + 1 times into k + 2 subpaths 



□ 



5.3 Deleting an Edge 

We now show a corresponding formula which will calculate the number of paths 
which are removed when we delete an edge from the graph. We are given the 
values as,t which count the number of paths from s to t, including those paths 
using edge (i, j) one or more times. We need to calculate the number of paths 
from s to t not using edge (z, j). 

Lemma 2. We can maintain the values of all Os^t upon deletion of an edge (i,j) 
by the formula 



n—2 

^ = Os,t - ^ asyr(-l)'"(rayi)'"aj,t mod r” 



(4) 
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The values as,t on the right hand side include in their counts paths using the 
edge (i,j) zero or more times. 



Proof. We prove this by calculating the number of times a path from s to t which 
uses the edge {i,j) I times is counted by the term as^ir{raj^i)^aj^f K will turn 
out that a path using the edge 0 < I < n times will be counted times by 

the fcth term. 

The proof uses the same one-to-one correspondence between items counted 
by the above product of truncated generating functions and decompositions of 
paths using edge (i,j) into subpaths of length less than n. However, since the 
subpaths from s to z counted by may include uses of the edge (i,j), we no 
longer have a unique decomposition into subpaths. 

A path which uses the edge (z,j) I times can be decomposed into k + 2 
subpaths separated by uses of the edge (z,j) in ways; just choose fc -I- 1 

of the I uses. The remaining I — k — 1 uses of the edge (z, j) are among the 
internal edges of the subpaths. Since these decompositions are in one-to-one 
correspondence with the items counted by the product as,ir{raj^i)^aj^t, each 
such path is counted times. 

The number of times each such path is counted in the sum 



n-2 



is 



n— 2 









k -\-l 



n — l 



= i-i+j2-(-iY 



k^l 




Since the alternating binomial sum X)fc=o(~l)^(fc) is always 0 for 0 < / < rz, we 
see that the sum shown above is equal to 1 for 0 < I < rz. By inspection, we see 
that paths not passing through (i,j) at all contribute nothing to this sum. 

Since this implies that the coefficient of r^ in X)fc=o o,s,ir{—l)^{raj^iYaj^t mod 
r" is exactly the number of paths of length k from s to t using edge (z, j) at 
least once, this is exactly the correct quantity to subtract from to get the 
new value ^ counting only the paths from s to t that do not use (z, j). □ 



Remark 2. The formulas for updating upon deletion and addition of an edge 
can be verified by seeing that these updates are functional inverses of each other. 
By composing the polynomial updates for an insertion and a deletion, we verify 
that Og^t remains unchanged mod r" when we insert and delete an edge. 

The final operation we need to be able to do is to query whether s and t are 
connected by a directed path in our graph. But there is a path from s to t if and 
only if the value is non-zero. This can easily be checked by an FO formula, 
and thus by a TC° circuit. 
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6 Computing the Updates in TC° 

In this section, we first show that the updates can be computed in TC^-uniform 
TC°. Then we show how our dynamic algorithm can be modified to use only 
FO-uniform TC° circuits. 

6.1 Computing the Updates Using a TC^-Uniform TC° Circuit 

The formula for updating the ag^t upon inserting an edge is 

n-2 

^'s,t = ^ as,ir{raj^^)^aj^t mod r^. (5) 

k^O 

The computational power of constant depth threshold circuits was investiga- 
ted by Reif and Tate in [10]. They found that polynomials with size and degree 
bounded by and with coefficients and variables bounded by 2" could be 
computed by polynomial-size constant depth threshold circuits (Corollary 3.4 in 
that paper) . We cannot use the result of Reif and Tate about the evaluation of 
polynomials directly because they only state that there exist polynomial-time- 
uniform TC° circuits to evaluate them. We show here that this can be done by 
TC^-uniform TC° circuits. 

Evaluating this polynomial requires us to raise numbers of 0{n^) bits to 
powers up to n — 2, multiply the results by other numbers, add n — 1 of the 
results together, and find the remainder mod r". Multiplying pairs of numbers 
and adding n numbers together can be done with TC*^ circuits [6]. If we have a 
number in binary, it is easy to find its remainder mod r" by dropping all but 
the low order bits. To raise the number raj^i to the /cth power, we will need 
the following lemma: 

Lemma 3. Finding the power of a number with bits is in TC^- 

uniform TC^ 

Proof. A corollary to the result of Beame, Cook, and Hoover^ [2] shows that we 
can multiply n numbers with n bits each with a TC*^ circuit that is constructible 
in logspace from the product of the first primes [7]. The product of the first 
primes can be computed by a TC^ circuit (a depth O(logn) binary tree of 
TC° circuits performing pairwise multiplications). The primes, all less than n^, 
can be found by an FO formula. Since logspace is contained in TC^, all of the 
operations required by our algorithm can be performed by a TC^-uniform TC*^ 
circuit. Extension from the case of numbers with n bits raised to the nth power 
to numbers with bits raised to the power is trivial because the classes 
TC^ and TC*^ are closed under polynomial expansion of the input. □ 

® Beame, Cook, and Hoover show that n numbers can be multiplied by multiplying 
their remainders modulo all of the small primes with O(logn) bits. Then, knowing 
the remainder of the desired product modulo each of these small primes, the Chinese 
remainder theorem is used to find the desired product modulo the product of all the 
small primes 
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6.2 Computing a TC^-Uniform TC° Circuit by an FO-Uniform TC° 
Circuit in the Dynamic Setting 



Because a TC^ circuit can be evaluated by logn rounds of a TC° circuit, we 
can compute this additional numerical data during the first log n rounds of our 
dynamic transitive closure algorithm. We simply add the circuit computing and 
storing this additional data to our existing TC*^ circuit for maintaining transitive 
closure. We need to add a third component to ensure that our algorithm com- 
putes transitive closure correctly during the first log n rounds, while the table 
of numerical data has not yet been computed. If we consider that log n rounds 
of updates can only add logn tuples to an initially empty relation, we see that 
this can be a first-order formula that finds the transitive closure of our log n 
size relation by brute force. We maintain a correspondence between those verti- 
ces with non-zero degree in our graph and the numbers between 1 and 2 logn. 
Then, to find out if two vertices are connected, we try (in parallel) all v? pos- 
sible subsets of {!,... , 2 logn} to see if they correspond to the vertices forming 
a path between the two vertices. It is easy to verify that a set of vertices forms 
a path between two given vertices, possibly with additional disconnected cycles. 
We simply check that all vertices have degree 2 in the induced subgraph except 
for the two ends of the path. 

Thus we can add this first-order algorithm to our TC*^ circuit to create a cir- 
cuit that correctly maintains the transitive closure of an initially empty relation, 
while computing the numerical data necessary for our full algorithm. The counts 
that our algorithm maintains of the number of paths between vertices have not 
yet been computed, however. To compute these, use the following log n rounds of 
computation to simulate 2 updates per round, catching up to the correct values 
of our counts after these logn rounds. Our first-order algorithm can easily be 
extended to work on graphs with up to 2 logn edges. 



Remark 3. This complex sequence of bootstrapping, computing some necessary 
numerical data during the first logn rounds of computation, threatens the me- 
moryless nature of our algorithm. To ensure that all of our auxiliary data is 
completely determined by the state of the input relation, we must perform these 
rounds of bootstrapping under the control not of the clock, during the first log n 
rounds of computation, but under the control of the number of edges in our 
graph. When an update adds an edge to our graph, we compute another round 
in our emulation of the TC^ circuit computing our numerical data, remembe- 
ring the results of all previous rounds. When an update deletes an edge from 
our graph, we delete the results of the last round of numerical computation. Si- 
milarly, we emulate the second log n rounds, where 2 updates are performed on 
every round, during all updates where the graph has between log n and 2 log n 
edges, performing updates so that the numerically first 2z — 2 log n out of i ed- 
ges are correctly represented in the counts. Thus, even the uniform TC° circuit 
maintaining the transitive closure of a relation can be made memoryless. 
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Because we have shown that there are uniform TC*^ circuits maintaining the 
transitive closure of a relation, starting with an empty relation and no precom- 
puted data, we have our main result: 

Theorem 1. Transitive Closure G DynTC^ 

7 DynTC° in SQL 

The fact that a DynTC® algorithm can be implemented as a incremental data- 
base algorithm with update queries expressed in SQL, using polynomial space, 
follows as a corollary from the fact that safe TC° queries can be written as SQL 
expressions. Theorem 5.27 in [6] states, in part, 

TC° = FO(COUNT), 

and Theorem 14.9 states that the queries computable in SQL are exactly the 
safe queries in Q(FO(COUNT)). 

Unfortunately, our TC*^ circuits do not correspond to safe queries. Our cir- 
cuits include negation, which can only be implemented in general in SQL if we 
have a full relation to subtract from, leaving a tuple in the result if and only if 
it is not in the input. But asserting the existence of a unary relation containing 
the numbers from 1 to n is a negligible amount of precomputation to expect. If 
we extend our algorithm to allow a dynamically varying number of vertices, we 
can certainly add the number n to this base relation at the same time as we add 
the nth vertex to our set of vertices. 

8 Adding and Deleting Vertices 

The fact that we have a polynomial bound on the amount of auxiliary data is 
closely bound to our having a fixed bound n on the number of vertices of the 
graph. If we do not have a fixed bound n, we can still use our algorithm as a 
polynomial space SQLIES algorithm. We can run an instance of the algorithm 
using some initial bound n until the number of vertices approaches within 2 log n 
of this bound. Then we initialize a new instance of the algorithm with bound 
2n, and are ready to start using it by the time we add the n + 1st vertex. After 
computing the auxiliary data used by this larger instance, we can copy the counts 
Pi,j{k) encoded in the integers Oij into the new larger copies of the integers atj. 
Since these are stored in binary, we just copy the counts to the right positions 
in the new integers, leaving larger gaps. We must then also calculate the values 
Pi,j{k) for values of k between n and 2n. These are the counts of paths with 
lengths between n and 2n. But as each path of length i, n < i < 2n — 1, has a 
unique decomposition into a path of length n — 1 and a path of length i — n + 1, 
we can find these values from the sum 

n—1 

'^Pi,h{n - l)ah,j. 
h—Q 
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The count of paths of length 2n — 1 can be found by repeating this process. 
These are all TC° computations, so they can be done in a single step. 

A final note is that the algorithm as stated does not allow us to delete a vertex 
with all of its associated edges in one step. However, an update polynomial even 
simpler than the update for deleting a single edge can be found, and is as follows: 

n—2 

Og t = ^ mod r” (6) 

fc =0 

9 Integer Matrix Powering 

The problem of directed reachability in a graph may be reduced to the problem 
of finding the entries of the nth power of the adjacency matrix of the graph. 
Our algorithm is based on this reduction, and the core algorithm can be modi- 
fied to maintain dynamically the first powers of an n by n integer matrix 

with entries of size Our algorithm can be seen upon inspection to handle 

correctly the addition of self-loops and multiple edges to our graph, so it cor- 
rectly handles adjacency matrices with non-zero diagonal and entries other than 
1. In our algorithm, we picked our constant r = 2" sufficient to separate the 
values which would be the entries of the /cth power of the adjacency matrix. For 
arbitrary matrices, we need to pick a larger r with polynomially many bits, suf- 
ficient to keep the entries separated. This dynamic algorithm will then compute 
arbitrary changes of a single matrix entry, and queries on the value of any entry 
of any of the computed powers of the matrix. 

Note that the dynamic complexity of the multiplication of n different n by 
n matrices is seen to be in DynTC*^ without this result, since only one entry of 
one of the matrices can be changed at a time, causing a linear update of the 
product. The dynamic complexity of the power of an integer matrix is greater 
than or equal to that of iterated multiplication, and it is this complexity which 
we have shown to be in DynTC®. 

10 Conclusions 

A major consequence of Theorem 1, Transitive Closure G DynTC®, is that tran- 
sitive closure can be maintained using SQL update queries while keeping the 
size of the auxiliary relations polynomial in the size of the input relation. As 
transitive closure has been used as the prime example of a database query not 
expressible in query languages without recursion, non-recursive algorithms for 
maintaining transitive closure are of significant interest. Our result reduces the 
space required by a non-recursive algorithm from exponential in the size of the 
input relation to polynomial in the size of the input relation. 

This new algorithm does not, however, lessen the importance of finding effi- 
cient sequential algorithms for maintaining transitive closure in databases. The 
dynamic algorithm given here is unlikely to be usefully implemented in prac- 
tice, because its work complexity is greater than the best sequential dynamic 
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algorithms. For example, the transitive closure of a symmetric relation (i.e. un- 
directed graph reachability) can be maintained by sequential algorithms with 
polylogarithmic time amortized time per operation [5]. 

Though this algorithm may not be practically useful in contexts where total 
work is the crucial constraint, it is an important upper bound for the following 
reason. Parallel complexity classes, using constant or logarithmic time, and po- 
lynomial work, have been seen to be the natural complexity classes smaller than 
P in the study of static complexity. They are robust under changes in encoding, 
and have natural connections to descriptive complexity classes. If dynamic com- 
plexity classes are similar to static complexity classes, then discovering what 
dynamic problems are in DynFO, DynTC®, and other similar classes may be im- 
portant. Since the dynamic complexity of problems is often less than the static 
complexity, the lower complexity classes may be even more important than in 
static complexity. This new upper bound for the dynamic complexity of directed 
reachability helps us to understand the landscape of dynamic complexity better. 
We now know that all special cases of reachability can be placed in dynamic 
complexity classes below or equal to DynTC°. 

There are open questions related to this paper in many directions. It is, of 
course, still an open question whether REACH is in DynFO. Many other restric- 
ted subclasses of graph reachability have not yet been investigated. We suspect 
that directed grid graph reachability and plane graph (planar graphs with a fixed 
embedding) reachability are in DynFO. There may be problems that are com- 
plete for these dynamic complexity classes, and logics that express exactly the 
class of queries with dynamic algorithms in some dynamic complexity class. The 
largest unknown, however, remains whether dynamic complexity classes exist 
that are as robust as the familiar static complexity classes, and whether they 
are orthogonal to or comparable to these static complexity classes. 
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Abstract. We define various extensions of first-order logic on linear as 
well as polynomial constraint databases. First, we extend first-order lo- 
gic by a convex closure operator and show this logic, FO(conv), to be 
closed and to have Ptime data-complexity. We also show that a weak 
form of multiplication is definable in this language and prove the equi- 
valence between this language and the multiplication part of PFOL. We 
then extend FO(conv) by fixed-point operators to get a query languages 
expressive enough to capture Ptime. In the last part of the paper we lift 
the results to polynomial constraint databases. 



1 Introduction 

In recent years new application areas have reached the limits of the standard 
relational database model. Especially, geographical information systems, which 
are of growing importance, exceed the power of the relational model with their 
need to store geometrical figures, naturally viewed as infinite sets of points. The- 
refore, new database models have been proposed to handle these needs. One such 
data model is the framework of constraint databases introduced by Kanellakis, 
Kuper, and Revesz in 1990 [KKR90]. Essentially, constraint databases are re- 
lational databases capable of storing arbitrary elementary sets. These sets are 
not stored tuple-wise but by an elementary formula defining them. We give a 
precise definition of the constraint database model in Section 2. See [KLPOO] for 
a detailed study of constraint databases. 

When applied to spatial databases, one usually considers either databases 
storing semi-algebraic sets, called polynomial constraint databases, or semi-linear 
sets, known as linear constraint databases. Polynomial constraint databases al- 
low the storage of spatial information in a natural way. The interest in the linear 
model results from the rather high (practical) complexity of query evaluation 
on polynomial databases. Essentially, the evaluation of first-order queries in the 
polynomial model consists of quantifier-elimination in the theory of ordered real 
fields, for which a non-deterministic exponential-time lower bound has been pro- 
ven. On the other hand, query evaluation on linear constraint databases can be 
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done efficiently. But first-order logic on these databases yields a query language 
with rather poor expressive power. 

It is known that first-order logic lacks the power to define queries relying on 
recursion. A prominent example is connectivity which is not expressible in FO on 
almost all interesting classes of structures. This lack of expressive power shows 
up on polynomial as well as linear constraint databases. In finite model theory, 
a standard method to solve this problem is to consider least fixed-point logic, an 
extension of FO by a least fixed-point operator (see [EF95].) Although this logic 
has successfully been used in the context of dense-order constraint databases 
(see [KKR90,GS97,GK99]), it is not suitable for the linear database model, as it 
is neither closed nor decidable on this class of databases (see [KPSV96].) Logics 
extending FO by fixed-point constructs can be found in [GK97] and [GKOO]. 
Decidability and closure of these languages was achieved by including some kind 
of stop condition for the fixed-point induction. In [KreOO], this problem has also 
been attacked resulting in a language extending FO by fixed-points over a finite 
set of regions in the input database. 

Besides queries based on recursion there are also other important queries that 
are not first-order definable. In the context of linear databases a very important 
example of a query not expressible in FO is the convex closure query. We address 
this problem in Section 3. Unfortunately, any extension of FO by operators 
capable of defining convex hulls for arbitrary sets leads to a non-closed language. 
Therefore, we can only hope to extend FO consistently by a restricted convex 
hull operator. In this paper we add an operator to compute convex hulls of 
finite sets only. We show that this can be done in a consistent way, resulting in a 
language that is closed and has Ptime data-complexity. It is also shown that this 
language and PFOL, an extension of FO by restricted multiplication defined by 
Vandeurzen, Gyssens, and Van Gucht [VGG98], have the same expressive power. 

As mentioned above, a language extending FO by recursion mechanisms has 
been defined in [KreOO] . Although this language turns out to be rather expressive 
for boolean queries, it lacks the power to define broad classes of non-boolean 
queries. In Section 4 we present two alternative approaches to define fixed-point 
logics on linear constraint databases. The first approach extends the logic defined 
in [KreOO] by the convex closure operator mentioned above, whereas the second 
approach combines convex closure with fixed-point induction over finite sets. 
It is shown that both approaches lead to query languages capturing Ptime on 
linear constraint databases. 

In Section 5 we address the problem whether these results extend to the 
class of polynomial constraint databases. Glearly, extending first-order logic by 
a convex closure operator does not make sense, since convex closure is already 
definable in FO+Poly, that is, first-order logic on polynomial constraint data- 
bases. But it will be shown that the extension by fixed-point constructs can be 
suitably adapted to the polynomial setting, resulting in a language strictly more 
expressive than FO+Poly. 
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2 Preliminaries 

Constraint databases. We first give a precise definition of the constraint da- 
tabase model. See [KLPOO] for a detailed introduction. The basic idea in the 
definition of constraint databases is to allow infinite relations which have a fi- 
nite presentation by a quantifier-free formula. Let 2t be a T-structure, called the 
context structure, and f{xi, . . . ,x„) be a quantifier-free formula of vocabulary 
r. We say that a n-ary relation R C is represented by (p(xi, . . . , Xn) over 
2t iff i? equals {a G A” : 21 ^ v[d]}- Let a := {i?i, . . . , Rk\ be a relational 
signature. A cr-constraint database over the context structure 21 is a u-expansion 
25 := (2t, Ri, . . . , Rk) of 2t where all Ri are finitely represented by formulae ipR^ 
over 2t. The set := • . • , ^Rk \ is called a finite representation of 23. 

To measure the complexity of algorithms taking constraint databases as 
inputs we have to define the size of a constraint database. Unlike finite da- 
tabases, the size of constraint databases cannot be given in terms of the number 
of elements stored in them but has to be based on a representation of the data- 
base. Note that equivalent representations of a database need not be of the same 
size. Thus, the size of a constraint database depends on a particular representa- 
tion. In the following, whenever we speak of a constraint database IB, we have 
a particular representation of IB in mind. The size |23| of 25 is then defined 
as the sum of the length of the formulae in <d>. This corresponds to the standard 
encoding of constraint databases by the formulae of their representation. 
Constraint queries. Fix a context structure 2t. A constraint query is a mapping 
Q from constraint databases over 21 to finitely representable relations over 2t. 
Note that queries are abstract, i.e., they depend only on the database not on their 
representation. That is, any algorithm that computes Q, taking a representation 
of a database 23 as input and producing a representation of <5(?B) as output, 
has to compute on two equivalent representations and <P' output formulae that 
are not necessarily the same, but represent the same relation on 2t. 

In the sequel we are particularly interested in queries defined by formu- 
lae of a given logic C. Let G £ be a formula with k free variables. Then 
(fi defines the query Q^p mapping a constraint database 25 over 2t to the set 

:= {(oi, . . . ,Ofc) : IB \= </3[a]}. In order for to be well defined, this set 
must be representable by a quantifier-free formula. If is first-order, this means 
that 2t admits quantifier elimination. For more powerful logics than first-order 
logic the additional operators must be eliminated as well. A logic £ is closed for 
a class C of constraint databases over 21, if for every tp G C and every 23 G C the 
set 1 ^® can be defined by a quantifier-free first-order formula over 21. 

Typical questions that arise when dealing with constraint query languages are 
the complexity of query evaluation for a certain constraint query language and 
the definability of a query in a given language. For a fixed query formula (p G C, 
the data- complexity of the query is defined as the amount of resources (e.g. 
time, space, or number of processors) needed to evaluate the function that takes 
a representation ^ of a database 25 to a representation of the answer relation 
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3 First-Order Logic and Convex Hulls 

In this section we define an extension of first-order logic on semi-linear databases 
such that with each definable finite set of points also its convex hull becomes 
definable. Defining convex hulls is a very important concept when dealing with 
semi-linear databases but first-order logic itself is not powerful enough to define 
it. Thus, adding an operator allowing the definition of convex hulls results in 
a language strictly more expressive than FO. Note that allowing to define the 
convex closure of arbitrary point sets yields a non-closed query language, since 
multiplication - and thus sets which are not semi-linear - becomes definable. We 
therefore allow the definition of convex hulls for finite sets of points only. 

Proviso. In the rest of this paper, whenever we speak about the interior of a 
set or a set being open, we always mean “interior” or “open” with respect to the 
set’s affine support. 



Definition 1 Let Xi,xfy denote sequences of I variables each and z denote a 
sequence of variables, such that all variables are distinct. The logic FO(conv) is 
defined as the extension of first-order logic by the following two rules: 

(i) If p € FO( conv) is a formula with free variables {xi, . . . , Xk, z} then ip := 
[conv 2 i,.,.,s;,(p](y, z) is also a formula, with free variables {y,z}. 

(ii) If ip € FO(conv) is a formula with free variables {xi,x'i, . . . ,Xk,x'f.,z} 
then Ip := [uconv^^ z) is also a formula, with free variables 

{y,z}. 

The semantics of the additional operators is defined as follows. Let iB be the 
input database and ip and ip be as in Part (i) of the definition above. Let t/?® 
be the result of evaluating the formula p in *8. If t/?® is infinite, then ip^ := 0. 
Otherwise, V'® := {(oj : a G IJ {conv{oi, . . . , Ofe} : iB |= . . . , a^, 6]}}, 

where conv{oi , . . . ,dk} denotes the interior (with respect to the affine support) 
of the convex closure of {hi, . . . , 0 ^}. 

The semantics of the uconv operator is defined similarly. The motivation for 
the uconv operator is, that every set defined by the conv operator is bounded. 
To overcome this restriction, the uconv operator is designed to handle “points at 
infinity” . Let p and ip he as indicated in Part (ii) of the definition above and let 03 
be the input database. Again, if (p® is infinite, then f/l® := 0. Otherwise, p'^ := 
{(a, b) : there are hi, a'l, . . . , hfc, h{, such that iB ^ p[di, h'l, . . . , hfc, h{., b] and 

h G convdjJ’^^ Hne(hi, h'))}, where line{di,d'f) := {x : {3b G R-°) such that 
X = di b{d{ — di)} defines the half line with origin di going through h'. 

Intuitively, each pair {di, h') represents two points, the point di and the point 
“reached” when starting at di and going in the direction of h' to infinite distance. 
Now uconv returns the union of the open convex closure (with respect to its affine 
support) of the 2k points represented by each tuple ((h^^, h' 1 ), . . . , {di^k, d{ ^)). 

Note that the condition on the formula p to define a finite set is purely 
semantical. Since finiteness of a semi-linear set is first-order definable - consider. 
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for example, the formula finite{ip) stating that there are e, <5 > 0 such that the 
Manhattan-distance between any two points in is greater than e and each 
point in (p® is contained in a hypercube of edge length S - this condition can also 
be ensured on a syntactic level, thus giving the language an effective syntax. 

We now give an example of a query definable in FO(conv) . 

Example 2 Let ^{x) he a formula defining a finite set. The query 

multip{x, y, z) := {(a, 6, c) : a satisfies ip and a ■ b = c} 

is definable in FO(conv). We give a formula for the case x,y,z > 0. Let x\ = 
{x\_iiXi^ 2 ) and X 2 ■= (x 2 .i,X 2 , 2 ) be pairs of variables and fii{xi,X 2 ]x) := x\ = 
(0,0) A X 2 = (l,a;) be a formula defining for each x the two points (0,0) and 
(l,x) m Then the formula 

if{x,y,z) := [uconvj^_5,2V5(a;) A fi{xi,X 2 ,x)]{y,z;x) 

defines the query mult^p. The uconv operator defines - for the parameter x - the 
half line with origin (0,0) and slope x. Thus the point (y,z) is on this line iff 
x ■ y = z. 

Theorem 3 FO(conv) is closed and has Ptime data- complexity. 

The closure of the conv-operator follows from the finiteness of the sets, of 
which the convex closure is computed. Ptime data-complexity can easily be 
shown by induction on the structure of the queries. 

We now compare the language to other extensions of FO for linear databases. 
Especially, we will show that FO(conv) and the language PFOL as defined by 
Vandeurzen et. al. [VGG98] have the same expressive power. We briefly recall 
the deflnition of PFOL (see [VGG98] and [Van99] for details.) 

There are two different kinds of variables in PFOL, real variables and so- 
called product variables. A PFOL program consists of a sequence of formulae 
{(pi{x ), . . . , Pk{x); p{x)). Each pi is required to define a finite set Di C M. The 
formulae are allowed to use multiplication between variables x ■ p, but at least 
p must be a product variable. All product variables must be bound at some 
point by a quantifier of the following form. In pi the quantifiers for the product 
variables are of the form 3p G Dj, where j < i. In p all Di may be used. Thus, 
essentially, the formulae pi define successively a sequence of finite sets and then 
at least one factor of each multiplication is restricted to one of these sets. In the 
original deflnition of PFOL there were also terms t = ^J\p\. We don’t take this 
term building rule into account and allow only multiplication. We believe this to 
be the essential and important part of the language. But note that the square 
root operator strictly increases the expressive power of the language. Therefore 
we call the language considered here restricted PFOL. 

It is known that convex closure can be defined in restricted PFOL. In Ex- 
ample 2 we already saw that multiplication with one factor bounded by a finite 
set can be defined in FO(conv). Thus, the proof of the following theorem is 
straightforward. 
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Theorem 4 Restricted PFOL = FO(conv). 



Note 5 As the previous theorem shows, we can express atoms of the form x-,py = 
z in FO(conv), with the semantics being that xy = z and (fi(x) holds, where {p 
is required to define a finite set. From now on, we allow atoms of this form in 
FO( conv)-formulae. 

The previous theorem implies that the bounds on expressiveness proven for 
PFOL carry over to FO(conv). Here we mention one result which will be of 
special importance in Section 4.2. 

It has been shown in [Van99] that there is a PFOL query which returns on 
a semi-linear set S C M” a finite relation 5*®“ C of (n -I- l)-tuples of 

points in R”, such that S = an+QeS*”' conv(oi, . . . ,a„+i). Thus, PFOL 

can compute a canonical finite representation of the input database and recover 
the original input from it. This will be used below to define a fixed-point query 
language capturing Ptime. 

4 Query Languages Combining Convex Hulls and 
Recursion 

In this section we present two approaches to combine the query language defined 
above with fixed-point constructs. 

4.1 The Logic RegLFP (conv) 

Let S' C be a semi-linear set. An arrangement of S is a partition of R'^ into 
finitely many disjoint regions, i.e., connected subsets of R'^, such that for each 
region R either i? fl S = 0 or i? C S. It is known that for a fixed dimension 
arrangements of semi-linear sets can be computed in polynomial time. See e.g. 
[Ede87] or [G097] for details. 

It follows from the definition that the input relation S can be written as a 
finite union of regions in its arrangement. In this section we consider a query 
language which has access to the set of regions in such an arrangement of the 
input database. This gives the logic access to the representation of the database 
increasing, thus, its expressive power. Precisely, we consider a fixed-point logic 
where the fixed-point induction is defined over the finite set of regions. The 
semantics of the logic is defined in terms of certain two-sorted structures, called 
region extensions of linear constraint databases. Let iB := ((R, <, -I-), S') be a 
database and let A(S) be the set of regions in an arrangement of S. The logic 
then has separate variables and quantifiers for the reals and the set of regions. 
We now give the precise definitions. 

Definition 6 Let 05 := ((R, <, -I-), S) he a linear constraint database, where S is 
a d-ary relation, and let A(S) he the set of regions of an arrangement of S. The 
structure iB gives rise to a two-sorted structure 05^®® := ((R, <, -I-), S; Reg, adj). 
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called the region extension of with sorts M and Reg := -4(5') and the adjacency 
relation adj C Reg x Reg, where two regions are adjacent if there is a point p in 
one of them such that every e-neighbourhood of p has a non-empty intersection 
with the other region. 

The dimension of a region is defined as the dimension of its affine support, 
i.e., the dimension of the smallest affine subspace it is contained in. It is known 
that the number of regions in the arrangement is bounded polynomially in the 
size of the representation of 5. As arrangements can be computed in polynomial 
time, one can compute the region extension of a given database in polynomial 
time as well. 

We now define the logic RegLFP(conv) which is FO(conv) extended by a 
least fixed-point operator on the set of regions in the region extensions. In the 
definition of the logic we deal with three types of variables, so-called element-, 
region-, and relation variables. Element variables will be interpreted by real 
numbers, region variables by regions in the region extension of the database. 
Each relation variable is equipped with a pair (fc, /) G of arities. Relation 
variables of arity (k,l) are interpreted by subsets M C Reg^ x such that for 
each R G Reg^ the set {x : {R,x) G M} is finitely representable. 

Definition 7 The logic RegLFP(conv) is defined as extension of FO(conv) on 
region extensions by a least fixed-point operator. Precisely, the logic extends 
FO(conv) by the following rules. 

— If R is a region variable and x is a sequence of element variables, then Rx 
is a formula. 

— If (p is a formula and R a region variable, then 3Rip is also a formula. 

— If M is a relation variable of arity {k, 1) , R \= R\, . . . , R^ is a sequence of 
region variables, and x := xi, ... ,xi is a sequence of element variables, then 
MRx is a formula. 

— If(fi{M, R, x) is a formula with free element variables x, free region variables 
R := Ri, . . . , Rk, and a free {k,l)-ary relation variable M, such that M 
occurs only positively in (p, then ^FFP fjji^^p\{R,x) is a formula. 

— If p{x) is a formula and x, y, z are element variables, then x y = z is a 
formula. 

Reg LFP(conv)- queries are defined by RegLFP (conv)-formulae without free region 
or relation variables. 

The semantics of the new rules is defined as follows. An atom Rx states that 
the point x is contained in the region R; an atom MRx states that the tuple 
(R,x) is contained in M; a formula 3Rp states that there is a region R G Reg 
satisfying p. The semantics of conv, uconv, and -,p is defined as in the previous 
section. 

Let M be a (fc, l)-aiy relation variable, R := Ri, . . . , Rk he a, sequence of re- 
gion variables, x := xi, ... ,xihe a sequence of element variables, and p{M, R, x) 
be a formula positive in M. The result of the formula ip := [tLFP^ ^ 
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evaluated in a database iB is defined as the least fixed-point of the function 
defined as 

: Pow{Reg'^ x W) — ^ Pow{Reg'' x M') 

M I — >■ {(/i!, a) G Reg'^ x R* : {3y{R,y) £ M A (R,a) £ M) V 

l^3y{R, y) £ M A^\= (p{M, R, a))}. 

Clearly, the least fixed-point of the function exists and can be computed 
inductively in time polynomially in the number of regions and thus also in the 
size of the database. 

Intuitively, we can think of the stages of the fixed-point induction as a set 
of tuples of regions R, where to each R there is a formula (Pr{x) attached to it 
defining a set of points in R.^ But once a tuple of regions is contained in some 
stage of the fixed-point induction, the formula attached to it cannot be changed 
anymore. This is ensured by the first disjunct in the definition of ftp. An example 
motivating this definition of the fixed-point operator is given below. 

Note that there are different decompositions of R"^ satisfying the conditions 
of an arrangement as presented above. Thus, the semantics of the logic depends 
on a particular decomposition chosen. But the results we prove below stay true 
for most decompositions, as long as the input can be recovered from them. Thus, 
for a given application area, one should choose a decomposition which is more 
intuitive to use. 

Having defined the logic, we now give a motivating example for it. 

Example 8 Suppose we are given a road map with cities and the highways con- 
necting them. Typically, the maps contain information about the distance between 
any two adjacent cities directly connected by a section of a highway. In this set- 
up a useful decomposition of the input into regions would be to have a region for 
each city, one for each section of a highway between two cities, as well as regions 
for the other parts of the map. 

Now suppose we want to travel from one city to another on a certain hig- 
hway and we want to know the distance between these two cities. Let the con- 
stants s, t denote the regions of the source and target city and assume a formula 
dist(C, C", d), stating that C and C are regions of adjacent cities and d is the 
distance between both on the chosen highway. Then the formula 

if{x) := [tLFP M, Cl, C2,d dist{Ci,C 2 ,d) V {3C3di3d2 M{Ci,C,di) A 
dist{C,C 2 ,d 2 ) A d = di 3- d 2 ) ]{s,t,x) 

defines the distance between the two cities. Here the real variable d in the fixed- 
point induction is used to sum up the distances. 

We now show that the logic is expressive enough to capture all PTiME-queries 
on linear constraint databases. 

Definition 9 A linear constraint database iB has the small coordinate property 
if the absolute values of the coordinates of all points contained in a 0-dimensional 
region, are bounded by where n is the number of regions in the region 

extension ofiB. 
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Theorem 10 RegLFP(conv) captures Ptime on the class of linear constraint 
databases having the small coordinate property. 

Sketch. As an arrangement of a semi-linear set can be computed in polynomial 
time, the region extension of the input database can be computed in Ptime 
as well. Now the Ptime data-complexity of RegLFP(conv) can be shown by 
induction on the structure of the queries. To prove that each PTiME-query can 
be defined by a RegLFP (conv)-qneTy, we show that the run of a Turing-machine 
M computing the query can be simulated. The crucial point is that, given the 
input relation S C a finite set R C of tuples of points such that S = 

lJ{conv(ai, . . . , Od+i) : (oi, ■ • ■ , Od+i) G .R} can be defined in RegLFP(conv). 
Further, given such a set R C for some I G N, the set {x : x G 

conv(ai, . . . , ci;+i) for some (oi, . . . , a;+i) G R} can be defined using the conv 
and uconv operators. This can be used to encode the input of a Turing-machine 
and to decode its output. The run of the Turing-machine can be simulated as 
usual in finite model theory, using region variables to denote positions on the 
Turing-tape. The restriction to databases with small coordinates comes from the 
fact, that in order to simulate the Turing-machine, coordinates of points have to 
be encoded by (tuples of) region variables. Thus, only polynomially many bits 
can be used to represent a coordinate of a point, restricting it to exponential 
size. □ 



4.2 Finitary Fixed-Point Logic 

The query language introduced in the previous section depends on a specific 
decomposition of the input database. Thus, its usability relies on the existence 
of a decomposition which can easily be understood by the user. Although this 
is the case in some application areas, it will be a problem in others. In this 
section we present a way to overcome this dependency on a specific, intuitive 
decomposition. 

Below we define finitary fixed-point logic as the extension of FO(conv) by a 
least fixed-point operator over arbitrary definable finite sets. The idea is that, 
as mentioned in Section 3, the language FO(conv) is capable of defining a finite 
representation of the input database. Using this one can replace the fixed-point 
induction on the regions by a fixed-point induction on the finite representation. 

We already mentioned that finiteness of a semi-linear set is first-order defin- 
able. Therefore we use formulae finite((/?), which, given a formula (p, evaluate to 
true if the set defined by (p is finite and false otherwise. 

Definition 11 Finitary Fixed-Point Logic (FFP) is defined as the extension of 
FO(conv) by a finitary LFP operator. Precisely, if x := xi,...,Xk and z := 
Zi, . . . , zi are sequences of first-order variables, R is a {k -\- l)-ary second-order 
variable, p{x) and fi’{R,x,z) are formulae, such that R occurs only positively in 
Ip and does not occur in p, then also [tLFP/{_g((p, ■i/;)](u, u) is a formula with free 
variables {u,v}, where u,v are sequences of variables of arity k and 1. 
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The semantics of a formula x ^/>)](m, u) is defined as the least 

fixed-point of the function 

/x : Potc(K'=+') — ^ Pow{R'^+^) 

R I — {{x, z) : {3y (x, y) G R A (x, z) G i?) V 

{{^3y (x, y) G i?) A IB 1= finite((^) A ip{x) A -ipiR, x, z)))}, 

where IB is the input database. Intuitively, the formula (p serves as a guard, 
ensuring that the fixed-point induction runs over the finite set defined by (p 
only. As before, the variables z can be used to attach some information to a 
tuple X contained in a induction stage. 

Regarding the expressive power of this language, one can easily show that the 
languages RegLFP(conv) and FFP are equivalent. Thus, FFP captures Ptime 
on the class of linear constraint databases. 

Theorem 12 (i) FFP captures Ptime on the class of linear constraint da- 

tabases having the small coordinate property. 

(ii) RegLFP(conv) and FFP have the same expressive power on the class of 
linear constraint databases. 



5 Polynomial Constraint Databases 

In this section we extend the approach taken in Section 4 to polynomial con- 
straint databases. Clearly, it does not make sense to add a convex closure opera- 
tor to first-order logic, as convex closure is already definable in FO on polynomial 
constraint databases. Thus, the logic PolyLFP is defined as the least fixed-point 
logic over region extensions, but the regions will now be defined by a cylindrical 
algebraic decomposition of the input space. 

In Section 5.1 we give a (very brief) overview of cylindrical algebraic de- 
compositions (CADs). See [Col75] or the monograph [CJ98] for details. We then 
define the region extension of polynomial constraint databases and introduce the 
logic PolyLFP. 

The data-complexity and expressive power of the logic is considered thereaf- 
ter. When dealing with complexity issues for constraint databases, one can base 
the examination on the Turing-model and restrict oneself to representing for- 
mulae with rational or real algebraic coefficients. Another approach is to ignore 
the complexity of the storage and manipulation of real numbers and use a com- 
putation model capable of storing arbitrary real numbers. These models have 
built-in functions for operations like multiplication and addition which can be 
executed in one time step. This approach puts more focus on the complexity of 
the underlying logic than on the complexity of manipulating numbers. There- 
fore we take this approach here and base our analysis of the complexity on the 
Blum-Shub-Smale (BSS) model. A brief introduction to BSS-machines can be 
found in Appendix A. 
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5.1 Cylindrical Algebraic Decomposition 

Fix a dimension d. A region is defined as a connected subset of The cylinder 
Z{R) over a region R is defined as i? x R.. If i? is a region and / : i? — >■ M is 
a continuous function, then the f -section of Z{R) is defined as the graph of /, 
that is, the set {(a, 6) : d G R and b = fd}. 




Fig. 1. Illustration of some concepts used in CADs. 



Now, let /i , /2 : i? — >■ R be continuous functions over R such that for all 
d € R fi{d) < / 2 (a). The {fi, f 2 )-sector of Z{R) is defined as the set {(a, &) : 
d £ R and fi{d) < b < / 2 (a)}- We allow /i ,/2 to be the constant functions 
— 00 , 00 . 

Let A C R'^ be a set. A decomposition of A is a finite collection of disjoint 
regions whose union is A. Let i? be a region and let fi,. . . ,fk ■ i? — >■ R be a 
sequence of continuous functions from R to R, such that fi{x) < fi+i{x) for 
all X £ R. This sequence defines a decomposition of the cylinder Z{R) into 
the regions defined by 1) the /j-sections, 2) the (/i, /i+i)-sectors where i £ 
{0, . . . , k -h 1} and fo(x) = —00 and fk+i(x) = 00 . Such a decomposition is 
called a stack over R (determined by fi, fk)- See Figure 1 for an illustration 
of these concepts. 

A decomposition D of R'^ is called cylindrical ii i) d = 1 and D is a stack 
over R*^, i.e. a finite number of singletons and intervals, or ii) d > 2 and there is 
a cylindrical decomposition D' of R"^”^ such that for each region R £ D' there 
is a subset S Q D forming a stack over R. Clearly, a cylindrical decomposition 
of R*^ determines a unique cylindrical decomposition of called the induced 

decomposition. 

A decomposition of R'^ is called algebraic, if every region is a semi-algebraic 
set. A cylindrical algebraic decomposition (CAD) ofW^ is a decomposition of R'^ 
which is both cylindrical and algebraic. 

Let A := {/i, . . . , fm} be a set of polynomials from R"^ to R. A decomposition 
D of R'^ is called A-invariant, if for each fi £ A and all regions R £ D, either 
for all a; G i? fi{x) > 0, fi{x) = 0, or fi{x) < 0. 
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Let := (K, <,+,-,5') be a database, where S is d-ary. A CAD D of 
is invariant for 5B, if it is invariant for the set of polynomials occurring in the 
representation of S. Thus, for each region R G D either i?nS'=0ori?CS'. 

It is known that a CAD of invariant for a given set A of polynomials 
can be computed in double exponential time in the dimension d and polynomial 
time in the size of A. Since the dimension is fixed, the algorithm operates in 
polynomial time in the size of A, resp. in the size of a database 05, if A is the 
set of polynomials occurring in the representation of 5B. 

Further, a close analysis of the algorithms shows, that if D is a CAD of 
invariant for a given set of polynomials of degree at most n, then, for d fixed, 
the degree of the polynomials defining the regions in D is polynomially bounded 
in n. See [Col75] for details. 



5.2 A Fixed-Point Logic for Polynomial Constraint Databases 

We define a fixed-point query language for polynomial constraint databases. In 
analogy to Section 4 this language is based on region extensions of databases. 

Definition 13 Let *8 := ((K., <, -I-, •)> >5') be a polynomial constraint database. 
The region extension of is defined as a two-sorted structure 05^®® := 

{(M., S; Reg). The first sort consists of the reals with order, addition, 

and multiplication, whereas the second sort consists of the set Reg of regions 
decomposing in a CAD invariant for 18. 

We now define the language PolyLFP as least fixed-point logic on region 
extensions. As before, the fixed-point induction is defined on the (finite) set 
of regions only. Since every database has a unique region extension we don’t 
distinguish between databases and their region extensions and freely speak about 
a database being a model of a PolyLFP formula instead of explicitly mentioning 
its region extension. 

Definition 14 The logic PolyLFP is defined as the extension of first-order logic 
by the following rules. As in Definition 6 there are element, region, and relation 
variables. 

— If R is a region variable and x is a sequence of element variables, then Rx 
is a formula. 

— If ip is a formula and R a region variable, then 3Rip is also a formula. 

— If M is a relation variable of arity {k,l), R := R\, . . . ,Rk are region varia- 
bles, and X := Xi, ... ,xi are element variables, then MRx is a formula. 

— If(p{M, R, x) is a formula with free element variables x, free region variables 

R := Ri, . . . , Rk, and a free {k,l)-ary relation variable M, such that (p is 
positive in M, then ]^^p\{R,x) is a formula. 

PolyLFP-queries are defined by PolyLFP-formulae without free region or relation 
variables. 
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The semantics of the logic is defined analogously to RegLFP(conv), with the 
region variables ranging over the set of regions in the region extension of a data- 
base and relation variables being interpreted as subsets of Reg’^ xKb Since the re- 
gion extension of a database can be computed in polynomial time and first-order 
logic on polynomial constraint databases has polynomial time data-complexity, 
the polynomial time data-complexity of PolyLFP follows immediately. 

We now show that Pk - the set of queries computable in polynomial time on a 
BSS-machine - can be captured by PolyLFP for a restricted class of polynomial 
constraint databases. 

Definition 15 A polynomial constraint database iB is said to be a k-degree da- 
tabase, if the highest degree of any variable of a polynomial in the representation 
of the database is at most k. 

Note that, as mentioned in Section 5.1, by bounding the degree of the poly- 
nomials in the input database, we also get a polynomial bound on the degree of 
the polynomials bounding the regions. 

Theorem 16 For each k € N, PolyLFP captures Pr on the class of k-degree 
databases. 

Sketch. Again, the theorem is proved by showing that the run of a BSS-machine 
computing a PR-query can be simulated, the crucial point being the represen- 
tation of the input database. Recall from above, that the regions are either 
/-sectors or (/i, / 2 )-sections, for some polynomial functions f,f\,fi. We define 
a representation of the database by a disjunction of formulae defining the re- 
gions contained in it. Let region i? be a /-section. Since, for some k € N, the 
input is restricted to be a fc-degree database and thus the degree of / is boun- 
ded by k, the polynomial / can be defined as / := k}<^ hiX*, where, for 

i := (fo, . . . , id) € {0, . . . , k}'^, x* denotes the product 0^=0 coefficients 

Ui in the sum can be defined by a formula (p{d) := \/x3y (X)ig{o k}<^ = V ^ 

Rxy), if the region R contains more than one point. The case where R contains 
only one point is trivial. The sectors can be defined similarly, since they are 
bounded by one or two sections. These sections are again regions and can thus 
be defined as described above. □ 

6 Conclusion 

We introduced logics extending first-order logic by recursion mechanisms on 
linear as well as polynomial constraint databases. For linear constraint databases 
we also introduced a logic extending FO by the ability to compute convex closure 
and showed that regarding expressive power this concept equals the logic PFOL, 
where multiplication with one factor bounded by a finite set is permitted. 

We showed that the fixed-point logics offer query languages with rather high 
expressive power, while still having tractable data-complexity. For practical im- 
plementations of these query languages, arrangements or cylindrical algebraic 




Query Languages for Constraint Databases 261 



decompositions probably offer not the most intuitive definition of the region do- 
main. But note that the results are independent of the precise definition of the 
region decomposition, as long as the database can be represented by a union 
of regions and the region decomposition can be computed in polynomial time. 
Thus, in a spatial database system, where spatial information is combined with 
non-spatial information giving a meaning to part of the spatial image, one could 
use a decomposition consistent with this semantical part. 
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A Blum-Shub-Smale-Machines 

Blum-Shub-Smale (BSS) machines have been introduced by Blum, Shub, and 
Smale in 1989 [BSS89]. We give a brief review of the computation model here. 
For a detailed introduction see [BSS89] and the monograph on real computation 
[BCSS98]. Note that we use a slightly different presentation of BSS machines 
than the presentation given by Blum, Shub, and Smale. 

Intuitively, a BSS machine is a random access machine with real registers 
and built-in operations for addition, subtraction, multiplication, and division. A 
precise definition is given below. 

Definition 17 We define as the set of all infinite sequences (oq, Oi, . . . ) of 
real numbers such that there is k with = 0 for all i > k. 



Definition 18 A BSS machine consists of the input space the state space 
S :=NxNxNx the output space R°° and a directed connected graph 
G := ({0, . . . , N}, V) for some N gN. Each vertex has at most two children and 
one of the following five types of operations assigned to it. The graph represents 
the program of the machine, where vertices can he thought of as commands and 
successors of nodes as the possible successive commands. 

— The node 0 is the input node. An input {yi,y 2 , ■ ■ ■ , Vk, 0, . . . ) G R°° is map- 
ped to (1, 1, 1, 2 / 1 , j/ 2 ) ■ • ■ ) G S. Thus the node 0 has the successor 1. 

— N is the output node. The machine stops in the position (N, i,j, xi,X 2 , . ■ • ) 
and outputs (xi,X 2 , . . .). The node N has no successors. 

— Computation nodes: An operation of this type transforms a state {n,i,j, 
xi,X 2 , . . .) G S to {n-\-l,i' , f , gn{xi,X 2 , • ■ • )) ^ S, where n -\- 1 is the succes- 
sive command, i' G {i -\- 1, ^},j' G {j -\- 1, 1}, and is one of the following 
basic operations: 

— Two registers Xr,Xg are added, subtracted, multiplied, or divided and the 
result is stored in Xr. 

— A register Xr is set to a real constant. 

The registers Xi with i ^ {r, s} are not altered. 

— Test nodes: A test node n has exactly two successors n -I- 1 and (3{n). On a 
state {n,i,j,Xi, . . .) G S the machine tests whether a;i > 0 and continues in 
node n-\- I if the answer is yes and in node fi{n) otherwise. 

— Copy nodes: In state (n, i,j, x\, . . .), the machine sets xj := Xi and continues 
at node n -I- 1. 

Based on BSS machines one can define complexity classes like Pr as the class 
of all problems computable on a BSS machine in polynomially many steps. See 
[BCSS98] for details. 
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Abstract. This document proposes an algebra for XML Query. The 
algebra has been submitted to the W3C XML Query Working Group. A 
novel feature of the algebra is the use of regular-expression types, similar 
in power to DTDs or XML Schemas, and closely related to Hasoya and 
Pierce’s work on Xduce. The iteration construct is based on the notion 
of a monad, and involves novel typing rules not encountered elsewhere. 



1 Introduction 

This document proposes an algebra for XML Query. 

This work builds on long standing traditions in the database community. In 
particular, we have been inspired by systems such as SQL, OQL, and nested 
relational algebra (NRA). We have also been inspired by systems such as Quilt, 
UnQL, XDuce, XML-QL, XPath, XQL, and YATL. We give citations for all 
these systems below. 

In the database world, it is common to translate a query language into an 
algebra; this happens in SQL, OQL, and NRA, among others. The purpose of 
the algebra is twofold. First, the algebra is used to give a semantics for the query 
language, so the operations of the algebra should be well-defined. Second, the 
algebra is used to support query optimization, so the algebra should possess a 
rich set of laws. Our algebra is powerful enough to capture the semantics of 
many XML query languages, and the laws we give include analogues of most of 
the laws of relational algebra. 

In the database world, it is common for a query language to exploit schemas 
or types; this happens in SQL, OQL, and NRA, among others. The purpose of 
types is twofold. Types can be used to detect certain kinds of errors at compile 
time and to support query optimization. DTDs and XML Schema can be thought 
of as providing something like types for XML. Our algebra uses a simple type 
system that captures the essence of XML Schema [42] . The type system is close 
to that used in XDuce. Our type system can detect common type errors and 
support optimization. A novel aspect of the type system (not found in Xduce) 
is the description of projection in terms of iteration, and the typing rules for 
iteration that make this viable. The algebra is based on earlier work on the use 
of monads to query semi-structured data, and iteration construct satisfies the 
three monad laws. 



J. Van den Bussche and V. Vianu (Eds.): ICDT 2001, LNCS 1973, pp. 263—300, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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This paper describes the key features of the algebra. For simplicity, we restrict 
our attention to only three scalar types (strings, integers, and booleans), but we 
believe the system will smoothly extend to cover the continuum of scalar types 
found in XML Schema. Other important features that we do not tackle include 
attributes, namespaces, element identity, collation, and key constraints, among 
others. Again, we believe they can be added within the framework given here. 

Two earlier versions of this paper have been distributed. The first [18] used 
a more ad-hoc approach to typing. The second [19] used a different notation for 
case analysis, and has less discussion of the relation to earlier work on monads. 

The paper is organized as follows. A tutorial introduction is presented in 
Section 2. Section 3 explains key aspects of how the algebra treats projection 
and iteration. The expressions of the algebra are summarized in Section 4. The 
type system is reviewed in Section 5. Some laws of the algebra are presented 
in Section 6. Finally, the static typing rules for the algebra are described in 
Section 7. Section 8 discusses open issues and problems. 

Cited literature includes: SQL [16], OQL [4,5,13], NRA [15,8,24,25], com- 
prehensions [33,6], monads [33,34,35], Quilt [11], UnQL [3], XDuce [21], XML 
Query [40,41], XML Schema [42,43], XML-QL [17], XPath [39,36], XQL [30], 
and YaTL [14]. 

2 The Algebra by Example 

This section introduces the main features of the algebra, using familiar examples 
based on accessing a database of books. 

2.1 Data and Types 

Consider the following sample data: 

<bib> 

<book> 

<title>Data on the Web</title> 

<year>1999</year> 

<author>Abiteboul</ author> 

<author>Bunemctn</ author> 

<author>Suciu</author> 

</book> 

<book> 

<title>XML Query</title> 

<year>200K/year> 

<author>Fernandez</ author> 

<author>Suciu</author> 

</book> 

</bib> 



Here is a fragment of a XML Schema for such data. 
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<xsd:group ncune="Bib"> 

<xsd: element name="bib"> 

<xsd ; complexType> 

<xsd: group ref="Book" 

min0ccurs="0" maxDccurs= "unbounded" /> 

</xsd : complexType> 

</xsd : element> 

</xsd:group> 

<xsd: group ncune="Book"> 

<xsd: element name="book"> 

<xsd : complexType> 

<xsd: element name="title" type="xsd: string"/> 
<xsd: element name="year" type="xsd: integer"/> 
<xsd: element name=" author" type="xsd: integer" 
minOccurs=" 1" maxDccurs= "unbounded" /> 

</xsd : complexType> 

</xsd : element> 

</xsd:group> 

This data and schema is represented in our algebra as follows: 

type Bib = 
bib [ Book* ] 
type Book = 
book [ 

title [ String ] , 
year [ Integer ] , 
author [ String ] + 

] 

let bibO : Bib = 
bib [ 
book [ 

title [ "Data on the Web" ] , 
year [ 1999 ] , 
author [ "Abiteboul" ] , 
author [ "Buneman" ] , 
author [ "Suciu" ] 

], 

book [ 

title [ "XML Query" ] , 
year [ 2001 ] , 
author [ "Fernandez" ] , 
author [ "Suciu" ] 

] 

] 
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The expression above defines two types, Bib and Book, and defines one global 
variable, bibO. 

The Bib type consists of a bib element containing zero or more entries of 
type Book. The Book type consists of a book element containing a title element 
(which contains a string), a year element (which contains an integer), and one 
or more author elements (which contain strings). 

The Bib type corresponds to a single bib element, which contains a forest 
of zero or more Book elements. We use the term forest to refer to a sequence of 
(zero or more) elements. Every element can be viewed as a forest of length one. 

The Book type corresponds to a single book element, which contains one 
title element, followed by one year element, followed by one or more author 
elements. A title or author element contains a string value and a year element 
contains an integer. 

The variable bibO is bound to a literal XML value, which is the data model 
representation of the earlier XML document. The bib element contains two book 
elements. 

The algebra is a strongly typed language, therefore the value of bibO must 
be an instance of its declared type, or the expression is ill-typed. Here the value 
of bibO is an instance of the Bib type, because it contains one bib element, 
which contains two book elements, each of which contain a string-valued title, 
an integer-valued year, and one or more string-valued author elements. 

For convenience, we define a second global variable bookO, also bound to a 
literal value, which is equivalent to the first book in bibO. 

let bookO : Book = 
book [ 

title [ "Data on the Web" ] , 
year [ 1999 ] , 
author [ "Abiteboul" ] , 
author [ "Buneman" ] , 
author [ "Suciu" ] 

] 

2.2 Projection 

The simplest operation is projection. The algebra uses a notation similar in 
appearance and meaning to path navigation in XPath. 

The following expression returns all author elements contained in bookO: 

bookO/ author 

==> author [ "Abiteboul" ] , 
author [ "Buneman" ] , 
author [ "Suciu" ] 

: author [ String ] + 

The above example and the ones that follow have three parts. First is an expres- 
sion in the algebra. Second, following the ==>, is the value of this expression. 
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Third, following the : , is the type of the expression, which is (of course) also a 
legal type for the value. 

The following expression returns all author elements contained in book ele- 
ments contained in bibO: 

bibO/book/author 
==> author [ "Abiteboul" ] , 
author [ "Buneman" ] , 
author [ "Suciu" ] , 
author [ "Fernandez" ] , 
author [ "Suciu" ] 

: author [ String ] * 

Note that in the result, the document order of author elements is preserved and 
that duplicate elements are also preserved. 

It may be unclear why the type of bibO/book/author contains zero or more 
authors, even though the type of a book element contains one or more authors. 
Let’s look at the derivation of the result type by looking at the type of each 
sub-expression: 

bibO : Bib 

bibO/book : Book* 

bibO/book/author : author [ String ] * 

Recall that Bib, the type of bibO, may contain zero or more Book elements, 

therefore the expression bibO/book might contain zero book elements, in which 
case, bibO/book/author would contain no authors. 

This illustrates an important feature of the type system: the type of an 
expression depends only on the type of its sub-expressions. It also illustrates 
the difference between an expression’s run-time value and its compile-time type. 
Since the type of bibO is Bib, the best type for bibO/book/author is one listing 
zero or more authors, even though for the given value of bibO the expression 
will always contain exactly five authors. 

One may access scalar data (strings, integers, or booleans) using the keyword 
dataO . For instance, if we wish to select all author names in a book, rather than 
all author elements, we could write the following. 

bookO/ author/dataO 
==> "Abiteboul", 

"Buneman" , 

"Suciu" 

: String* 

Similarly, the following returns the year the book was published. 

bookO/ year/ data ( ) 

==> 1999 
: Integer 

This notation is similar to the use of textO in XPath. We chose the keyword 
dataO because, as the second example shows, not all data items are strings. 
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2.3 Iteration 



Another common operation is to iterate over elements in a document so that 
their content can be transformed into new content. Here is an example of how 
to process each book to list the authors before the title, and remove the year. 

for b in bibO/book do 

book [ b/author, b/title ] 

==> book [ 

author [ "Abiteboul" ] , 
author [ "Bunemaui" ] , 
author [ "Suciu" ] , 
title [ "Data on the Web" ] 

], 

book [ 

author [ "Fernandez" ] , 
author [ "Suciu" ] , 
title [ "XML Query" ] 

] 

: book [ 

author [ String ]+, 
title [ String ] 

]* 



The for expression iterates over all book elements in bibO and binds the varia- 
ble b to each such element. For each element bound to b, the inner expression 
constructs a new book element containing the book’s authors followed by its 
title. The transformed elements appear in the same order as they occur in bibO. 

In the result type, a book element is guaranteed to contain one or more 
authors followed by one title. Let’s look at the derivation of the result type to 
see why: 



bibO/book 

b 

b/ author 
b/title 



Book* 

Book 

author [ String ] + 
title [ String ] 



The type system can determine that b is always Book, therefore the type of 
b/author is author [ String ] + and the type of b/title is title [ String ] . 

In general, the value of a for loop is a forest. If the body of the loop itself 
yields a forest, then all of the forests are concatenated together. For instance, 
the expression: 



for b in bibO/book do 
b/author 



is exactly equivalent to the expression bibO/book/author. 

Here we have explained the typing of for loops by example. In fact, the 
typing rules are rather subtle, and one of the more interesting aspects of the 
algebra, and will be explained further below. 
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2.4 Selection 

Projection and for loops can serve as the basis for many interesting queries. The 
next three sections show how they provide the power for selection, quantification, 
join, and regrouping. 

To select values that satisfy some predicate, we use the where expression. 
For example, the following expression selects all book elements in bibO that 
were published before 2000. 

for b in bibO/book do 

where b/year/data() <= 2000 do 
b 

==> book [ 

title [ "Data on the Web" ] , 
year [ 1999 ] , 
author [ "Abiteboul" ] , 
author [ "Buneman" ] , 
author [ "Suciu" ] 

] 

: Book* 

An expression of the form 
where e\ do C 2 
is just syntactic sugar for 
if Cl then 62 else () 

where e\ and 62 are expressions. Here 0 is an expression that stands for the 
empty sequence, a forest that contains no elements. We also write 0 for the 
type of the empty sequence. 

According to this rule, the expression above translates to 

for b <- bibO/book in 

if b/year/data() < 2000 then b else () 

and this has the same value and the same type as the preceding expression. 

2.5 Quantification 

The following expression selects all book elements in bibO that have some author 
named “Buneman” . 

for b in bibO/book do 
for a in b/author do 

where a/data() = "BuneniEtn" do 
b 

==> book [ 
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title [ "Data on the Web" ] , 
year [ 1999 ] , 
author [ "Abiteboul" ] , 
author [ "Bunemaui" ] , 
author [ "Suciu" ] 

] 

Book* 



In contrast, we can use the empty operator to find all books that have no 
author whose name is Buneman: 



for b in bibO/book do 

where empty (for a in b/author do 

where a/data() = "Buneman" do 
a) do 



b 



==> book [ 
title 
year 
author 
author 

] 

: Book* 



[ "XML Query" ] , 
[ 2001 ] , 

[ "Fernandez" ] , 
[ "Suciu" ] 



The empty expression checks that its argument is the empty sequence (). 

We can also use the empty operator to find all books where all the authors 
are Buneman, by checking that there are no authors that are not Buneman: 



for b in bibO/book do 

where empty (for a in b/author do 

where a/data() <> "Buneman" do 
a) do 



b 



==> 0 



Book* 



There are no such books, so the result is the empty sequence. Appropriate use 
of empty (possibly combined with not) can express universally or existentially 
quantified expressions. 

Here is a good place to introduce the let expression, which binds a local 
variable to a value. Introducing local variables may improve readability. For 
example, the following expression is exactly equivalent to the previous one. 

for b in bibO/book do 

let nonbunemans = (for a in b/author do 

where a/data() <> "Bunemaui" do 
a) do 

where empty (nonbunemans) do 
b 
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Local variables can also be used to avoid repetition when the same subexpression 
appears more than once in a query. 

Later we will introduce match expressions, and we will see how to define 
empty using match in Section 6. 

2.6 Join 

Another common operation is to join values from one or more documents. To 
illustrate joins, we give a second data source that defines book reviews: 

type Reviews = 
reviews [ 
book [ 

title [ String ] , 
review [ String ] 

]* 

] 

let reviewO : Reviews = 
reviews [ 
book [ 

title [ "XML Query" ] , 
review [ "A darn fine book." ] 

], 

book [ 

title [ "Data on the Web" ] , 
review [ "This is great ! " ] 

] 

] 

The Reviews type contains one reviews element, which contains zero or more 
book elements; each book contains a title and review. 

We can use nested for loops to join the two sources reviewO and bibO on 
title values. The result combines the title, authors, and reviews for each book. 

for b in bibO/book do 

for r in reviewO/book do 

where b/title/data() = r/title/data() do 
book [ b/title, b/author, r/review ] 

book [ 

title [ "Data on the Web" ] , 
author [ "Abiteboul" ] , 
author [ "Buneman" ] , 
author [ "Suciu" ] 
review [ "A darn fine book." ] 

], 




272 



M. Fernandez, J. Simeon, and P. Wadler 



book [ 

title [ "XML Query" ] , 

author [ "Fernandez" ] , 

author [ "Suciu" ] 

review [ "This is great!" ] 

] 

: book [ 

title [ String ] , 
author [ String ] + 

review [ String ] 

]* 

Note that the outer- most for expression determines the order of the result. 
Readers familiar with optimization of relational join queries know that relational 
joins commute, i.e., they can be evaluated in any order. This is not true for the 
XML algebra: changing the order of the first two for expressions would pro- 
duce different output. In Section 8, we discuss extending the algebra to support 
unordered forests, which would permit commutable joins. 

2.7 Restructuring 

Often it is useful to regroup elements in an XML document. For example, each 
book element in bibO groups one title with multiple authors. This expression 
regroups each author with the titles of his/her publications. 

for a in distinct (bibO/book/author) do 
biblio [ 
a, 

for b in bibO/book do 
for a2 in b/author do 

where a/data() = a2/data() do 
b/title 

] 

==> biblio [ 

author [ "Abiteboul" ] , 
title [ "Data on the Web" ] 

], 

biblio [ 

author [ "BunemEui" ] , 
title [ "Data on the Web" ] 

], 

biblio [ 

author [ "Suciu" ] , 

title [ "Data on the Web" ] , 

title [ "XML Query" ] 

], 
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biblio [ 

author [ "Fernandez" ] , 
title [ "XML Query" ] 

] 

: biblio [ 

author [ String ] , 
title [ String ] * 

]* 

Readers may recognize this expression as a self-join of books on authors. The 
expression distinct (bibO/book/author) produces a forest of author elements 
with no duplicates. The outer for expression binds a to each author element, 
and the inner for expression selects the title of each book that has some author 
equal to a. 

Here distinct is an example of a built-in function. It takes a forest of ele- 
ments and removes duplicates. 

The type of the result expression may seem surprising: each biblio element 
may contain zero or more title elements, even though in bibO, every author 
co-occurs with a title. Recognizing such a constraint is outside the scope of 
the type system, so the resulting type is not as precise as we might like. 



2.8 Aggregation 

The algebra has five built-in aggregation functions: avg, count, max, min and 
sum. This expression selects books that have more than two authors: 

for b in bibO/book do 

where count (b/author) > 2 do 
b 

==> book [ 

title [ "Data on the Web" ] , 
year [ 1999 ] , 
author [ "Abiteboul" ] , 
author [ "Bunematn" ] , 
author [ "Suciu" ] 

] 

: Book* 

All the aggregation functions take a forest with repetition type and return an 
integer value; count returns the number of elements in the forest. 



2.9 Functions 

Functions can make queries more modular and concise. Recall that we used the 
following query to find all books that do not have “Buneman” as an author. 
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for b in bibO/book do 

where empty (for a in b/author do 

where a/data() = "Buneman" do 
a) do 
b 

==> book [ 

title [ "XML Query" ] , 

year [ 2001 ] , 

author [ "Fernandez" ] , 

author [ "Suciu" ] 

] 



A different way to formulate this query is to first define a function that takes a 
string s and a book b as arguments, and returns true if book b does not have 
an author with name s. 

fun notauthor (s : String; b : Book) : Boolean = 
emptyCfor a in b/author do 

where a/data () = s do 
a) 



The query can then be re-expressed as follows. 



for b in bibO/book do 

where notauthor ("Buneman" ; b) do 
b 

==> book [ 

title [ "XML Query" ] , 

year [ 2001 ] , 

author [ "Fernandez" ] , 

author [ "Suciu" ] 

] 

: Book* 



We use semicolon rather than comma to separate function arguments, since 
comma is used to concatenate forests. 

Note that a function declaration includes the types of all its arguments and 
the type of its result. This is necessary for the type system to guarantee that 
applications of functions are type correct. 

In general, any number of functions may be declared at the top-level. The 
order of function declarations does not matter, and each function may refer to 
any other function. Among other things, this allows functions to be recursive 
(or mutually recursive), which supports structural recursion, the subject of the 
next section. 
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2.10 Structural Recursion 

XML documents can be recursive in structure, for example, it is possible to define 
a part element that directly or indirectly contains other part elements. In the 
algebra, we use recursive types to define documents with a recursive structure, 
and we use recursive functions to process such documents. (We can also use 
mutual recursion for more complex recursive structures.) 

For instance, here is a recursive type defining a part hierarchy. 

type Part = 

Basic I Composite 
type Basic = 
basic [ 

cost [ Integer ] 

] 

type Composite = 
composite [ 

assembly_cost [ Integer ] , 
subparts [ Part+ ] 

] 

And here is some sample data. 

let partO : Part = 
composite [ 

assembIy_cost [ 12 ] , 
subparts [ 
composite [ 

assembIy_cost [ 22 ] , 
subparts [ 

basic [ cost [ 33 ] ] 

] 

], 

basic [ cost [ 7 ] ] 

] 

] 

Here vertical bar ( I ) is used to indicate a choice between types: each part is either 
basic (no subparts), and has a cost, or is composite, and includes an assembly 
cost and subparts. 

We might want to translate to a second form, where every part has a total 
cost and a list of subparts (for a basic part, the list of subparts is empty). 

type Part2 = 
part [ 

totaI_cost [ Integer ] , 
subparts [ Part2* ] 



] 
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Here is a recursive function that performs the desired transformation. It uses 
a new construct, the match expression. 

fun convert (p : Part) : Part2 = 
match p 

case b : Basic do 
part [ 

total_cost[ b/cost/data() ], 
subparts [] 

] 

case c : Composite do 

let s = (for q in children(c/subparts) do convert (q)) in 
part [ 

total_cost [ 

c/assembly_cost/data() + sum(s/total_cost/data() ) 

], 

subparts [ s ] 

] 

else error 

Each branch of the match expression is labeled with a type, Basic or Composite, 
and with a corresponding variable, b or c. The evaluator checks the type of the 
value of p at run-time, and evaluates the corresponding branch. If the first branch 
is taken then b is bound to the value of p, and the branch returns a new part 
with total cost the same as the cost of b, and with no subparts. If the second 
branch is taken then c is bound to the value of p. The function is recursively 
applied to each of the subparts of c, giving a list of new subparts s. The branch 
returns a new part with total cost computed by adding the assembly cost of c 
to the sum of the total cost of each subpart in s, and with subparts s. 

One might wonder why b and c are required, since they have the same value 
as p. The reason why is that p, b, and c have different types. 

p : Part 
b : Basic 
c : Composite 

The types of b and c are more precise than the type of p, because which branch 
is taken depends upon the type of value in p. 

Applying the query to the given data gives the following result. 

convert (partO) 

==> part [ 

total_cost [ 74 ] , 
subparts [ 

part [ 

total_cost [ 55 ] , 
subparts [ 
part [ 
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total_cost [ 33 ] , 
subparts [] 

] 

] 

], 

part [ 

total_cost [ 7 ] , 
subparts [] 

] 

] 

] 

: Part2 

Of course, a match expression may be used in any query, not just in a recur- 
sive one. 

2.11 Processing Any Well-Formed Document 

Recursive types allow us to define a type that matches any well-formed XML 
document. This type is called UrTree: 

type UrTree = UrScalar I ~ [UrType] 
type UrType = UrTree* 

Here UrScalar is a built-in scalar type. It stands for the most general scalar 
type, and all other scalar types (like Integer or String) are subtypes of it. The 
tilde (~) is used to indicate a wild-card type. In general, ~ [t] indicates the type 
of elements that may have any element name, but must have children of type 
t. So an UrTree is either an UrScalar or a wildcard element with zero or more 
children, each of which is itself an UrTree. In other words, any single element or 
scalar has type UrTree. 

Types analogous to UrType and UrScalar appear in XML Schema. The use 
of tilde is a significant extension to XML Schema, because XML Schema has no 
type corresponding to ~[t], where t is some type other than UrType. It is not 
clear that this extension is necessary, since the more restrictive expressiveness 
of XML Schema wildcards may be adequate. 

In particular, our earlier data also has type UrTree. 

bookO : UrTree 
==> book [ 

title [ "Data on the Web" ] , 
year [ 1999 ] , 
author [ "Abiteboul" ] , 
author [ "Bunemaui" ] , 
author [ "Suciu" ] 

] 

: UrTree 
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A specific type can be indicated for any expression in the query language, by 
writing a colon and the type after the expression. 

As an example, we define a recursive function that converts any XML data 
into HTML. We first give a simplified definition of HTML. 

type HTML = 

( UrScalar 
I b [ HTML ] 

I ul [ (li [ HTML ])* ] 

)* 

An HTML body consists of a sequence of zero or more items, each of which is 
either: a scalar; or a b element (boldface) with HTML content; or a ul element 
(unordered list), where the children are li elements (list item), each of which 
has HTML content. 

Now, here is the function that performs the conversion. 

fun html_of _xml ( t : UrTree ) : HTML = 
match t 

case s : UrScalar do 
s 

case e : ~ [UrType] do 
b [ name (e) ] , 

ul [ for c in children(e) do li [ html_of_xml(c) ] ] 
else error 

The case expression checks whether the value of t is data or an element, and 
evaluates the corresponding branch. If the first branch is taken, then s is bound 
to the value of t, which must be a scalar, and the branch returns the scalar. If 
the second branch is taken, then e is bound to the value of t, which must be an 
element. The branch returns the name of the element in boldface, followed by a 
list containing one item for each child of the element. The function is recursively 
applied to get the content of each list item. 

Applying the query to the book element above gives the following result. 

html_of _xml(bookO) 

==> b [ "book" ] , 

ul [ 



li 


[ b 


[ 


"title" ] , 


ul 


[ 


li 


[ 


"Data on the Web" ] 


li 


[ b 


[ 


"year" ] , 


ul 


[ 


li 


[ 


1999 ] ] ] , 


li 


[ b 


[ 


"author" ] , 


ul 


[ 


li 


[ 


"Abiteboul" ] ] ] , 


li 


[ b 


[ 


"author" ] , 


ul 


[ 


li 


[ 


"Buneman" ] ] ] , 


li 


[ b 


[ 


"author" ] , 


ul 


[ 


li 


[ 


"Suciu" ] ] ] 



] 

Html_Body 
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2.12 Top-Level Queries 

A query consists of a sequence of top-level expressions, or query items, where each 
query item is either a type declaration, a function declaration, a global variable 
declaration, or a query expression. The order of query items is immaterial; all 
type, function, and global variable declarations may be mutually recursive. 

A query can be evaluated by the query interpreter. Each query expression 
is evaluated in the environment specified by all of the declarations. (Typically, 
all of the declarations will precede all of the query expressions, but this is not 
required.) We have already seen examples of type, function, and global variable 
declarations. An example of a query expression is: 

query html_of_xml(bookO) 

To transform any expression into a top-level query, we simply precede the ex- 
pression by the query keyword. 

3 Projection and Iteration 

This section describes key aspects of projection and iteration. 

3.1 Relating Projection to Iteration 

The previous examples use the / operator liberally, but in fact we use / as 
a convenient abbreviation for expressions built from lower-level operators: for 
expressions, the children function, and match expressions. 

For example, the expression: 
bookO/ author 

is equivalent to the expression: 

for c in children (bookO) do 
match c 

case a : author [UrType] do a 
else 0 

Here the children function returns a forest consisting of the children of the 
element bookO, namely, a title element, a year element, and three author elements 
(the order is preserved). The for expression binds the variable v successively to 
each of these elements. Then the match expression selects a branch based on the 
value of V. If it is an author element then the first branch is evaluated, otherwise 
the second branch. If the first branch is evaluated, the variable a is bound to 
the same value as x, then the branch returns the value of a. The type of a is 
author [String] , which is the the intersection of the type of c and the type 
author [UrType] . If the second branch is evaluated, then then branch returns 
(), the empty sequence. 

To compose several expressions using /, we again use for expressions. For 
example, the expression: 



bibO/book/author 




280 



M. Fernandez, J. Simeon, and P. Wadler 



is equivalent to the expression: 

for c in children (bibO) do 
match c of 

case b : book[UrType] => 
for d in children (b) do 
match d of 

case a : author [UrType] => a 
else 0 
else 0 

The for expression iterates over all book elements in bibO and binds the variable 
b to each such element. For each element bound to b, the inner expression returns 
all the author elements in b, and the resulting forests are concatenated together 
in order. 

In general, an expression of the form e / a is converted to the form 

for V\ in e do 

for W 2 in childrenCui) do 
match V 2 of 

case V 3 : a [UrType] do V 3 
else 0 

where e is an expression, a is an element name, and v\, V 2 , and V 3 are fresh 
variables (ones that do not appear in the expression being converted). 
According to this rule, the expression bibO/book translates to 

for vl in bibO do 

for v2 in children(vl) do 
match v2 of 

case v3 : book [UrType] do v3 
else 0 

In Section 3.3 we discuss laws which allow us to simplify this to the previous 
expression 

for v2 in children (bibO) do 
match v2 of 

case v3 : book [UrType] do v3 
else 0 

Similarly, the expression bibO/book/author translates to 

for v4 in (for v2 in children(bibO) do 
match v2 

case v3 : book [UrType] do v3 
else 0) do 

for v5 in children(v4) do 
match v5 

case v6 : author [UrType] do v6 
else 0 
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Again, the laws will allow us to simplify this to the previous expression 

for v 2 in children (bibO) do 
match v2 

case v 3 : book[UrType] do 
for v 5 in children (v 3 ) do 
match v 5 

case v6 : author [UrType] do d 
else 0 
else 0 

These examples illustrate an important feature of the algebra: high-level opera- 
tors may be defined in terms of low-level operators, and the low-level operators 
may be subject to algebraic laws that can be used to further simplify the ex- 
pression. 

3.2 Typing Iteration 

The typing of for loops is rather subtle. We give an intuitive explanation here, 
and cover the detailed typing rules in Section 7. 

A unit type is either an element type a [t] , a wildcard type ~ [t] , or a scalar 
type s. A for loop 

for V in e\ do 62 

is typed as follows. First, one finds the type of expression e\. Next, for each unit 
type in this type one assumes the variable v has the unit type and one types 
the body 62. Note that this means we may type the body of 62 several times, 
once for each unit type in the type of e\. Finally, the types of the body 62 are 
combined, according to how the types were combined in e\. That is, if the type 
of Cl is formed with sequencing, then sequencing is used to combine the types 
of 62, and similarly for choice or repetition. 

For example, consider the following expression, which selects all author ele- 
ments from a book. 

for c in children (bookO) do 
match c 

case a : author do a 
else 0 

The type of children (bookO) is 

title [String] , year [Integer] , author [String] + 

This is composed of three unit types, and so the body is typed three times. 

assuming c has type title [String] the body has type 0 

” year [Integer] ” () 

” author [String] ” author [String] 

The three result types are then combined in the same way the original unit types 
were, using sequencing and iteration. This yields 
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0, 0, author [String] + 
as the type of the iteration, and simplifying yields 
author [String] + 
as the final type. 

As a second example, consider the following expression, which selects all 
title and author elements from a book, and renames them. 

for c in children (bookO) do 
match c 

case t : title [String] do titl [ t/data() ] 

case y : year [Integer] do () 

case a : author [String] do auth [ a/data() ] 

else error 

Again, the type of children (bookO) is 

title [String] , year [Integer] , author [String] + 

This is composed of three unit types, and so the body is typed three times. 

assuming c has type title [String] the body has type titl [String] 

” year [Integer] ” () 

” author [String] ” auth [String] 

The three result types are then combined in the same way the original unit types 
were, using sequencing and iteration. This yields 

titl [String] , () , auth [String] + 

as the type of the iteration, and simplifying yields 

titl [String] , auth [String] + 

as the final type. Note that the title occurs just once and the author occurs one 
or more times, as one would expect. 

As a third example, consider the following expression, which selects all basic 
parts from a sequence of parts. 

for p in children (partO/subparts) do 
match p 

case b : Basic do b 

case c : Composite do () 
else error 

The type of children (partO/subparts) is 
(Basic I Composite) + 
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This is composed of two unit types, and so the body is typed two times. 

assuming p has type Basic the body has type Basic 
” Composite ” () 

The two result types are then combined in the same way the original unit types 
were, using sequencing and iteration. This yields 

(Basic I ())+ 

as the type of the iteration, and simplifying yields 
Basic* 

as the final type. Note that although the original type involves repetition one 
or more times, the final result is a repetition zero or more times. This is what 
one would expect, since if all the parts are composite the final result will be an 
empty sequence. 

In this way, we see that for loops can be combined with match expressions 
to select and rename elements from a sequence, and that the result is given a 
sensible type. 

In order for this approach to typing to be sensible, it is necessary that the unit 
types can be uniquely identified. However, the type system given here satisfies 
the following law. 

a Hi I t 2 l = a Hi] I oH 2 ] 

This has one unit type on the left, but two distinct unit types on the right, 
and so might cause trouble. Fortunately, our type system inherits an additio- 
nal restriction from XML Schema: we insist that the regular expressions can 
be recognized by a top-down deterministic automaton. In that case, the regular 
expression must have the form on the left, the form on the right is outlawed be- 
cause it requires a non-deterministic recognizer. With this additional restriction, 
there is no problem. 

The method of translating projection to iteration described in the previous 
section combined with the typing rules given here yield optimal types for projec- 
tions, in the following sense. Say that variable x has type t, and the projection 
X / a has type t' . The type assignment is sound if for every value of type t, the 
value oi X / a has type t' . The type assignment is complete if for every value y 
of type t' there is a value x of type t such that x / a = y. In symbols, we can see 
that these conditions are complementary. 

sound: \/x € t.3y € t' . x / a = y 

complete: \/y € t' .3x € t. x / a = y 

Any sensible type system must be sound, but it is rare for a type system to be 
complete. But, remarkably, the type assignment given by the above approach is 
both sound and complete. 
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3.3 Monad Laws 

Investigating aspects of homological algebra in the 1950s, category theorists un- 
covered the concept of a monad, which among other things generalizes set, bag, 
and list types. Investigating programming languages based on lists in the 1970s, 
functional programmers adapted from set theory the notion of a comprehension, 
which expresses iteration over set, bag, and list types. In the early 1990s, a pre- 
cise connection between monads and comprehension notation was uncovered by 
Wadler [33,34,35], who was inspired by Moggi’s work applying monads to de- 
scribe features of programming languages [27,28]. As the decade progressed this 
was applied by researchers at the University of Pennsylvania to database langu- 
ages for semi-structured data [6], particularly nested relational algebra (NRA) 
[8,25,24] and the Kleisli system [37]. 

The iteration construct of the algebra corresponds to the structure of a mo- 
nad. The correspondence is close, but not exact. Each monad is based on a unary 
type constructor, such as Set{t) or List{t), representing a homogenous set or list 
where all elements are of type t. In contrast, here we have more complex and 
heterogenous types, such as a forest consisting of a title, a year, and a sequence 
of one or more authors. Also, one important component of a monad is the unit 
operator, which converts an element to a set or list. If x has type t, then {x} is 
a unit set of type Set{t) or [x] is a unit list of type List{t). In contrast, here we 
simply write, say, author ["Buneman"] , which stands for both a tree and for the 
unit forest containing that tree. 

One can define comprehensions in terms of iteration: 

[ Co I Xi <- Cl ] = for Xi in Ci do Cq 

[ eo I Xi <- ei , X2 <- 62 ] = for xi in Ci do for X2 in 62 do Cq 



Conversely, one can define iterations in terms of comprehension: 
for X in 6i do 62 = [ y I x <- 61 , y <- 62 ] 

Here y is a fresh variable name. 

Monads satisfy three laws, and three corresponding laws are satisfied by the 
iteration notation given here. 

First, iteration over a unit forest can be replaced by substition. This is called 
the left unit law. 

for V in ci do 62 = 62{u := 61} 

provided that ei is a unit type (e.g., is an element or a scalar constant). We 
write 61 {w := 62} to denote the result of taking expression 61 and replacing 
occurrences of the variable v by the expression 62. For example 

for V in author ["Buneman"] do auth[v/data()] = auth ["Buneman"] 

Second, an iteration that returns the iteration variable is equivalent to the 
identity. This is called the right unit law. 



f or u in 6 do X = 6 
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For example 

for V in bookO do v = bookO 

An important feature of the type system described here is that the left side of 
the above equation always has the same type as the right side. (This was not 
true for an earlier version of the type system [18].) 

Third, there are two ways of writing an iteration over an iteration, both of 
which are equivalent. This is called the associative law. 

for V2 in (for Vi in ei do 62) do 63 
= for vi in ei do (for V2 in 62 do 63) 

For example, a projection over a forest includes an implicit iteration, so e / a = 
for V in e do v / a. Say we define a forest of bibliographies, bibl = bibO , 
bibO. Then bibl/book/author is equivalent to the first expression below, which 
in turn is equivalent to the second. 

for b in (for a in bibl do a/book) do b/author 
= for a in bibl do (for b in a/book do b/author 

With nested relational algebra, the monad laws play a key role in optimizing 
queries. For instance, they are exploited extensively in the Kleisli system for 
biomedical data, developed by Limsoon Wong and others at the University of 
Pennsylvania and Kent Ridge Digital Labs, and now sold commercially [37]. 
Similarly, the monad laws can also be exploited for optimization in this context. 

For example, if b is a book, the following find all authors of the book that 
are not Buneman: 

for a in b do 

when a/data() != BunemEui do 
a 

If 1 is a list of authors, the following renames all author elements to auth 
elements: 

for a’ in 1 do 

auth[ a’ /data 0 ] 

Combining these, we select all authors that are not Buneman, and rename the 
elements: 

for a’ in (for a in b do 

when a/data() != Buneman do 
a) do 

auth[ a’ /data 0 ] 

Applying the associative law for a monad, we get: 
for a in b do 

for a’ in (when a/data() != Buneman do a) do 
auth[ aVdataO ] 
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Expanding the when clause to a conditional, we get: 
for a in b do 

for a’ in (if a/data() != Buneman then a else ()) do 
auth[ a ’/data 0 ] 

Applying a standard for loops over conditionals gives: 
for a in b do 

if a/data() != Buneman then 
for a’ in a do 

auth[ a’/dataO ] 
else 0 

Applying the left unit law for a monad, we get: 
for a in b do 

if a/data() != Buneman then 
auth[ a/data 0 ] 
else 0 

And replacing the conditional by a when clause, we get: 
for a in b do 

when a/data() != Bunemcui do 
auth[ a/data 0 ] 

Thus, simple manipulations, including the monad laws, fuse the two loops. 

Section 3.1 ended with two examples of simplification. Returning to these, 
we can now see that the simplifications are achieved by application of the left 
unit and associative monad laws. 

4 Expressions 

Figure 1 contains the grammar for the algebra, i.e., the convenient concrete 
syntax in which a user may write a query. A few of these expressions can be 
rewritten as other expressions in a smaller core algebra; such derived expressions 
are labeled with We define the algebra’s typing rules on the smaller core 
algebra. In Section 6, we give the laws that relate a user expression with its 
equivalent expression in the core algebra. Typing rules for the core algebra are 
defined in Section 7. 

We have seen examples of most of the expressions, so we will only point out 
a few details here. We include only two operators, + and =, and one aggregate 
function sum in the formal syntax, adding others is straightforward. 

A query consists of a sequence of query items, where each query item is either 
a type declaration, a function declaration, a global variable declaration, or a 
query expression. The order of query items is immaterial; all type, function, and 
global variable declarations may be mutually recursive. Each query expression 
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tag 


a 




function 


/ 




variable 


V 




integer 


Cint • 


:=...|-1|0|1| 


string 


Cstr • 


:= "" "a" "b" 


boolean 


Cbool • 


:= false | true 


constant 


C 


•— Cint 1 Cstr | Cbool 


expression 


e : 


:= c 



V 

ale] 

~e[e] 



e , e 

0 

if e then e else e 
let w = e do e 
for t in e do e 
match eo 

case Vi : t\ do ei 



scalar constant 

variable 

element 

computed element 

sequence 

empty sequence 

conditional 

local binding 

iteration 

match 



query item q 



data d 



case Vn : t„ do e„ 
else 

I /(e;...;e) 

I error 0 
I e + e 
I e = e 
I sum(e) 

I children(e) 

I name(e) 

I e / a 
I e/data() 

I where e then e 
I empty (e) 

::= type x = t 
I fun = e 

I let V : t = e 
I query e 

I aid] 

I d ,d 

I o 



function application 

error 

plus 

equal 

aggregation 
children 
element name 
element projection * 

scalar projection * 

conditional * 

empty test * 

type declaration 
function declaration 
global declaration 
query expression 
scalar constant 
element 
sequence 
empty sequence 



Fig. 1. Expressions 
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element name 


a 






type name 


X 






scalar type 


s ::= 


Integer 






1 


String 






1 


Boolean 






1 


UrScalar 




type 


t \:= 


X 


type name 




1 


s 


scalar type 




1 


a[t] 


element 




1 


~ra 


wildcard 




1 


t , t 


sequence 




1 


t 1 t 


choice 




1 


t* 


repetition 




1 


0 


empty sequence 




1 


0 


empty choice 


unit type 


u ::= 


altl 


element 




1 


~Ltl 


wildcard 




1 


s 


scalar type 



Fig. 2. Types 



is evaluated in the environment specified by all of the declarations. (Typically, 
all of the declarations will precede all of the query expressions, but this is not 
required.) 

We define a subset of expressions that correspond to data values. An expres- 
sion is a data value if it consists only of scalar constant, element, sequence, and 
empty sequence expressions. 

5 Types 

Figure 2 contains the grammar for the algebra’s type system. We have already 
seen many examples of types. Here, we point out some details. 

Our algebra uses a simple type system that captures the essence of XML 
Schema [42]. The type system is close to that used in XDuce [21]. 

In the type system of Figure 2, a scalar type may be a UrScalar, Boolean, 
Integer, or String. In XML Schema, a scalar type is defined by one of fourteen 
primitive datatypes and a list of facets. A type hierarchy is induced between 
scalar types by containment of facets. The algebra’s type system can be genera- 
lized to support these types without much increase in its complexity. We added 
UrScalar, because XML Schema does not support a most general scalar type. 

A type is either: a type variable; a scalar type; an element type with element 
name a and content type t; a wildcard type with an unknown element name and 
content type t; a sequence of two types, a choice of two types; a repetition type; 
the empty sequence type; or the empty choice type. 
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The algebra’s external type system, that is, the type definitions associated 
with input and output documents, is XML Schema. The internal types are in 
some ways more expressive than XML Schema, for example, XML Schema has no 
type corresponding to Integer* (which is required as the type of the argument 
to an aggregation operator like sum or min or max), or corresponding to ~ [t] 
where t is some type other than UrTree*. In general, mapping XML Schema 
types into internal types will not lose information, however, mapping internal 
types into XML Schema may lose information. 

5.1 Relating Values to Types 

Recall that data is the subset of expressions that consists only of scalar constant, 
element, sequence, and empty sequence expressions. We write \~ d : t if data d 
has type t. The following type rules define this relation. 



Cint • 


Integer 


^ ^str ■ 


String 


Cbool ■ 


Booleaui 


h c : UrScalar 


h d 


: t 


haW] 


: a[t] 


h d 


: t 



haW] : ~ltl 
h di : ti \- d2 '■ t2 



\- di , d2 ■ ti , t2 



h 0 


: 0 


h d 


: ti 


hd : 


1 ^2 


h d 





\- d : (ti I ^ 2 ) 
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\- d\ \ t \- d,2 '■ t* 
h (di,d2) : t* 

^0 ■■ t* 

We write ti <: t 2 if for every data d such that \~ d : it is also the case 

that \- d : t 2 , that is ti is a subtype of t 2 - It is easy to see that <: is a partial 
order, that is it is reflexive, t <: t, and it is transitive, if ti <: t 2 and t 2 <: ts 
then t\ <: Here are some of the inequations that hold. 

0 <: t 

t <: UrType 

ti <: ti I t2 
t2 <: t\ I t2 

Integer <: UrScalar 
String <: UrScalar 
Boolean <: UrScalar 

a W <: ~ M 

Further, if t <: t' then 

alf] <: a[t'] 
t* <: t'* 

And if ti <:t'i and t 2 <■ t '2 then 

ti , t2 <■■ t[ , t '2 

ti I t2 <: t'l I t '2 

We write ti = ^2 if C ^2 and t 2 C ti- Here are some of the equations that 
hold. 

UrScalar = Integer | String | Boolean 
{ti , ^ 2 ) , ts = ti , {t2 , ts) 

t , O =t 

O , t =t 

tl I ^2 = ^2 I tl 

(H I ^ 2 ) I ^3 = I (^2 I H) 

t\% =t 

0 U =t 

tl , (^2 I ^ 3 ) = {tl , ^ 2 ) I {tl , ts) 

{tl I ^ 2 ) , ts = {tl , ts) I {t2 , ts) 
t ,% =0 

0 , t =0 

t* = o \ t ,t* 

We also have that ti <: t 2 if and only iff ti | ^2 = ^ 2 - 

We define t? and t+ as abbreviations, by the following equivalences. 

t? = O \ t 
t+ = t , t* 
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We define the intersection t\/\t2 of two types t\ and t2 to be the largest type 
t that is smaller than both t\ and t2- That is, t = t\ At2 ii t <: t\ and t <: t2 
and if for any t' such that t' <: ti and t' <: t2 we have t <: t' . 

6 Equivalences and Optimization 

6.1 Equivalences 

Here are the laws that define derived expressions (those labeled with * in Fi- 
gure 1) in terms of other expressions. 

e / a 

= for Vi in e do (1) 

for V2 in children(wi) do 
match V2 

case r)3 : a do V3 
else 0 



e / dataO 

= for vi in e do (2) 

for V2 in children(ui) do 
match V2 

case V3 : UrScalar do V3 
else 0 

where ci then 62 

= if Cl then 62 else 0 (3) 

empty (e) 

= match 6 case v : 0 do true else false (4) 

Law 1 rewrites the element projection expression e / a, as described previously. 
Law 2 rewrites the scalar projection expression e / dataO, similarly. Law 3 
rewrites a where expression as a conditional, as described previously. Law 4 
rewrites an empty test using a match expression. 

6.2 Optimizations 

In a relational query engine, algebraic simplifications are often applied by a query 
optimizer before a physical execution plan is generated; algebraic simplification 
can often reduce the size of the intermediate results computed by a query in- 
terpreter. The purpose of our laws is similar - they eliminate unnecessary for 
or match expressions, or they enable other optimizations by reordering or dis- 
tributing computations. The set of laws given here is suggestive, rather than 
complete. 
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Here are some simplification laws. 



for 


V 


in 


0 


do e = 0 










(1) 


for 


V 


in 


(e 


1 , 62) do 


63 










= 


(f 


or 


V 


in Cl do 63) 


, (for 


V in 


62 do 63) 


(2) 


for 


V 


in 


ei 


do 62 












= 


e2 


{v 


:= 


ei}, if e : 


u 








(3) 


match 


a[eo] 


case V : 


a 


do Cl 


else 


62 




= 


ei 


{v 




1 1 

0 










(4) 


match 


a' [cq] 


case V : 


a 


do Cl 


else 


62 




= 


e2 


!t 


if a a' 










(5) 


for 


V 


in 


e 


do V = e 










(6) 



Laws 1, 2, and 3 simplify iterations. Law 1 rewrites an iteration over the 
empty sequence as the empty sequence. Law 2 distributes iteration through 
sequence: iterating over the sequence ei , C2 is equivalent to the sequence of two 
iterations, one over ei and one over 62. Law 3 is the left unit law for a monad. If 
ei is a unit type, then ei can be substituted for occurrences of v in 62- Laws 4 
and 5 eliminate trivial case expressions. Law 6 is the right unit law for a monad. 

The remaining laws commute expressions. Each law actually abbreviates a 
number of other laws, since the context variable E stands for a number of different 
expressions. The notation E[e] stands for one of the six expressions given with 
expression e replacing the hole [ ] that appears in each of the alternatives. 

E ::= if [] then ei else 62 
I let u = [] do e 
I for u in [] do e 
I match [] 

case Vi : t\ do Ci 

case Vn ■ tn do e„ 
else e„+i 

Here are the laws for commuting expressions. 

E[if ei then 62 else 63] 

= if ei then E[e2] else £^[63] (7) 

E[let V = 6i do 62] 

= let V = ei do E[e2] (8) 

E[for V in ei do 62] 

= for V in ei do E[e2] (9) 
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[match eg 

case vi : ti do ei 

case Vn : t„, do e„ 
else 

= match Co (10) 

case vi : ti do -E[ei] 

case Vn : do -E[e„] 

else E[en+i] 

Each law has the same form. Law 7 commutes conditionals, Law 8 commutes 
local bindings, Law 9 commutes iterations, and Law 10 commutes match expres- 
sions. For instance, one of the expansions of Law 9 is the following, when E is 
taken to be for v in [] do e. 

for V2 in (for V\ in e\ do 62) do 63 
= for vi in ei do (for V2 in 62 do 63) 

This will be recognized as the associative law for a monad. 

7 Type Rules 

We explain our type system in the form commonly used in the programming 
languages community. For a textbook introduction to type systems, see, for 
example, Mitchell [26]. 

7.1 Environments 

The type rules make use of an environment that specifies the types of variables 
and functions. The type environment is denoted by E, and is composed of a 
comma-separated list of variable types, v : t or function types, / : (ti; . . . ; — >■ 

t. We retrieve type information from the environment by writing (v : t) G E to 
look up a variable, or by writing (f : (ti; ; t„) -G- t) G E to look up a function. 

7.2 Type Rules 

We write The : t if in environment E the expression e has type t. Below are 
all the rules except those for for and match expressions, which are discussed in 
the following subsections. 



E h Cint : Integer 



E h Cstr : String 
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r Cbool : Boolean 

{v :t) € r 
r \- V : t 

r \- e : t 
r \- a\.e] : a [t] 

r \- 6i : String F \~ 62 ■ t 
F h ~ei [62] : ~ [t] 

F \- e\ \ t\ F \- 62 ■ t2 
F \- Cl , 62 '■ t\ , t2 



rh 0 : 0 



F \- 6i : Boolean _T h 62 : ^2 F \~ 

h if Cl then 62 else 63 : (^2 | tz) 

F \- 6i \ ti F, V : ti \- 62 -■ t2 
F h let = ei do 62 : ^2 



(/ : (^i; • ■ • ; tn) t) £ F 

r h 6i : t[ t'l <: ti 



F I- /(ei; . . . ; e„) : t 



F h error 0 : 0 

_r h 6 i : Integer 7^ h 62 : Integer 
-T h 6 i + 62 : Integer 

F \- 61 : ti F \- 62 ■ t2 

-T h 61 = 62 : Boolean 
F \- 6 : Integer* 



F\- sum 6 : Integer 
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7.3 Typing for Expressions 

The type rule for for expressions uses the following auxiliary judgement. We 
write T h for u : t do e : t' if in environment F when the bound variable of an 
iteration v has type t then the body e of the iteration has type t' . 

r, V : u \- e : t' 
r h for u : M do e : 



Thforu: 0 doe: () 

r \- for V : t\ do e : t'l F \- for v : do e : t '2 

F h for V : fi , ^2 do e : , t '2 



Thforr;:0doe:0 

F \- for V : t\ do e : t'l F \- for v : t 2 do e : 

F h for V -.ti \ t 2 do e :t'i \ t '2 

F [- for V : t do e : t' 

F h for V do e t'* 

Given the above, the type rule for for expressions is immediate. 

F \- ei ti T h for u : G do 62 : t2 
T h for u in ei do 62 : ^2 

7.4 Typing Match Expressions 

Due to the rule for iteration, it is possible that the body of an iteration is checked 
many times. Thus, when a match expression is checked, it is possible that quite 
a lot is known about the type of the expression being matched, and one can 
determine that only some of the clauses of the match apply. The definition of 
match uses the auxiliary judgments to check whether a given clause is applicable. 

We write T h case u : t do e : if in environment F when the bound variable 

of the case v has type t then the body e of the case has type t' . Note the type 
of the body is irrelevant if t = 0. 

fyf0 F, V : t \- e : t' 

F h case v : t do e : t' 



F h case u : 0 do e : 0 
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We write r' h t <: else e : t" if in environment F when t <: t' does not 

hold then the body e of the else clause has type t” . Note that the type of the 
body is irrelevant if t <: t' . 



t <: t' 

r t <: t' else e : 0 

r\-e ■. t” 

r \- t <■. t' else e : t” 

Given the above, it is straightforward to construct the typing rule for a match 
expression. Recall that we write t At' for the intersection of two types. 

G h eo : to 

r h case v± : to A t\ do ei : t'^ 



r h 



r h case v„ : to A t„ do e„ : tjj 
rh to <: ti I • ■ ■ I t„ else e„+i : t^+j 
/ match eo \ 

case vi : t\ do ci 

. : i'l I ••• I K. 

case Vn ■ tn do Cn 
Y else e„+i 



n+1 



7.5 Top-Level Expressions 

We write T h q if in environment T the query item q is well- typed. 



r h type X = t 

T, wi : ti, . . . , : t„ h e : t' t' <: t 

r \- fiviUi; . . . ;v„:tn) :t = e 

r e : t' t' <: t 
r h let V : t = e 

F \- e : t 
F h query e 

We extract the relevant component of a type environment from a query item 
q with the function environment{q) . 

environment{tjpe x = t) = () 

environment{f\m ,f(vi :ti ; . . . ; :t„) :t) = f : (ti; . . . ; t„) -A t 

environment{let v : t = e) = v : t 
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We write \- qi ... q„ if the sequence of query items (ji . . . is well typed. 
r = environment{qi) , ..., environment{q„) 

r\~ qi • • • r h 

\- qi...q„ 



8 Discussion 

The algebra has several important characteristics: its operators are orthogonal, 
strongly typed, and they obey laws of equivalence and optimization. 

There are many issues to resolve in the completion of the algebra. We enu- 
merate some of these here. 

Data Model. Currently, all forests in the data model are ordered. It may be 
useful to have unordered forests. The distinct operator, for example, produ- 
ces an inherently unordered forest. Unordered forests can benefit from many 
optimizations for the relational algebra, such as commutable joins. 

The data model and algebra do not define a global order on documents. 
Querying global order is often required in document-oriented queries. 

Currently, the algebra does not support reference values, which are defined 
in the XML Query Data Model. The algebra’s type system should be extended 
to support reference types and the data model operators ref and deref should 
be supported. 

Type System. As discussed, the algebra’s internal type system is closely related to 
the type system of XDuce. A potentially significant problem is that the algebra’s 
types may lose information when converted into XML Schema types, for example, 
when a result is serialized into an XML document and XML Schema. 

The type system is currently first order: it does not support function types 
nor higher-order functions. Higher-order functions are useful for specifying, for 
example, sorting and grouping operators, which take other functions as argu- 
ments. 

The type system is currently monomorphic: it does not permit the definition 
of a function over generalized types. Polymorphic functions are useful for fac- 
toring equivalent functions, each of which operate on a fixed type. The lack of 
polymorphism is one of the principal weaknesses of the type system. 

Operators. We intentionally did not define equality or relational operators on 
element and scalar types undefined. These operators should be defined by con- 
sensus. 

It may be useful to add a fixed-point operator, which can be used in lieu of 
recursive functions to compute, for example, the transitive closure of a collection. 

Functions. There is no explicit support for externally defined functions. 

The set of builtin functions may be extended to support other important 
operators. 
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Recursion. Currently, the algebra does not guarantee termination of recursive 
expressions. In order to ensure termination, we might require that a recursive 
function take one argument that is a singleton element, and any recursive invo- 
cation should be on a descendant of that element; since any element has a finite 
number of descendants, this avoids infinite regress. (Ideally, we should have a 
simple syntactic rule that enforces this restriction, but we have not yet devised 
such a rule.) 
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Abstract. Rewriting queries using views is a powerful technique that 
has applications in query optimization, data integration, data warehou- 
sing etc. Query rewriting in relational databases is by now rather well 
investigated. However, in the framework of semistructured data the pro- 
blem of rewriting has received much less attention. In this paper we focus 
on extracting as much information as possible from algebraic rewritings 
for the purpose of optimizing regular path queries. The cases when we 
can find a complete exact rewriting of a query using a set a views are 
very “ideal.” However, there is always information available in the views, 
even if this information is only partial. We introduce “lower” and “pos- 
sibility” partial rewritings and provide algorithms for computing them. 
These rewritings are algebraic in their nature, i.e. we use only the al- 
gebraic view definitions for computing the rewritings. This fact makes 
them a main memory product which can be used for reducing secondary 
memory and remote access. We give two algorithms for utilizing the 
partial lower and partial possibility rewritings in the context of query 
optimization. 



1 Introduction 

Semistructured data is a self-describing collection, whose structure can naturally 
model irregularities that cannot be captured by relational or object-oriented data 
models [ABS99]. This kind of data is usually best formalized in terms of labelled 
graphs, where the graphs represent data found in many useful applications such 
web information systems, XML data repositories, digital libraries, communica- 
tion networks, and so on. Almost all the query languages for semi-structured 
data provide the possibility for the user to query the database through regular 
expressions. The design of query languages using regular path expressions is ba- 
sed on the observation that many of the recursive queries that arise in practice 
amount to graph traversals. These queries are in essence graph patterns and the 
answers to the query are subgraphs of the database that match the given pattern 
[MW95,FLS98,GGLV99,GGLV2000]. 

For example, for answering a query containing in it the regular expression 
(_* • article) ■ (_* • ref ■ _* • {ullman + widom)) one should find all the paths having 
at some point an edge labelled article, followed by any number of other edges 
then by an edge ref and finally by an edge labelled with ullman or widom. 
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Based on practical observations, the most expensive part of answering que- 
ries on semistructured data is finding these graph patterns described by regular 
expressions. This is, because a regular expression can describe arbitrary long 
paths in the database which means in turn an arbitrary number of physical ac- 
cesses. Hence it is clear that, having a good optimizer for answering regular path 
(sub)queries is very important. This optimizer can be used for the broader class 
of full fledged query languages for semistructured data. 

In semistructured data, as well as in other data models such as relational and 
object oriented, the importance of utilizing views is well recognized [LMSS95], 
[CGLV99], [Lev99]. Simply stated, the problem is: Given a query Q and a set of 
views {Vi, ... ,Vn}, And a representation of Q by means of the views and then 
answer the query on the basis of this representation. Several papers [LMSS95, 
U1197,GM99,PV99] investigate this problem for the case of conjunctive queries. 
The methods in these papers are based on the query containment and the fact 
that the number of literals in the minimal rewriting is bounded from above by 
the number of literals in the query. 

It is obvious that a method for rewriting of regular path queries requires a 
technique for rewriting of regular expressions, i.e. given a regular expression E 
and a set of regular expressions Ei,E 2 , ..., E^ one wants to compute a function 
f{Ei,E 2 ,...,En) which approximates E. As far as the authors know, there are 
two methods for computing such a function / which approximates E from below. 
The first one of Gonway [Gon71] is based on the derivatives of regular expres- 
sions which provide the ground for the development of an algebraic theory of 
factorization in the regular algebra that in turn gives the tools for computing 
the approximating function. The second method by Galvanese et al [GGLV99] 
is automata based. Both methods are equivalent in the sense that they compute 
the same rewriting of a query. However, these methods model -using views- only 
full paths of the database, i.e. paths whose labels spell a word belonging to the 
regular language of the query. But in practice, the cases in which we can infer 
from the views full paths for the query are very “ideal.” The views can cover 
partial paths which can be satisfactory long for using them in optimization but 
if they are not complete paths, they are ignored by the above mentioned me- 
thods. So, it would probably be better to give a partial rewriting in order to 
encapture all the information provided by the views. The information provided 
by the views is always useful, even if it is partial and not complete. The problem 
of a partial rewriting is touched upon briefly in [GGLV99]. However, there this 
problem is considered only as an extension of the complete rewriting, enriching 
the set of the views with new elementary one-symbol views, and materializing 
them before query evaluation. The choice of the new elementary views to be 
materialized is done in a brute force way, using some cost criteria depending on 
the application. 

In this paper we use a very different approach. For each word in the regular 
language of the query we do the best possible using views. If the word contains 
a sub-path that a view has traversed before, we use that view for evaluation. We 
present generalized query answering algorithms that access the database only 
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when necessary. For the ’’been there” subpaths our algorithms use the views. 
Note that we do not materialize any new views, we only consult the database 
”on the fly,” as needed. 

The outline of the paper is as follows. In Section 2 we formalize the pro- 
blem of query rewriting using views in the realistic framework of cached views 
and available database. Then we discuss the utility of algebraic rewritings. We 
illustrate through an example that the complete rewritings can be empty for 
a particular query, while the partial information provided by the views is no 
less than 99% of the complete “missing” information. In Section 3 we introduce 
and formally define a new algebraic, formal-language operator, the exhaustive 
replacement. Simply described, given two languages Li and L 2 , the result of the 
exhaustive replacement of L — 2 in Li is the replacement, by a special symbol, 
of all the words of L 2 that occur as sub- words in the words of Li. Then we 
give a theorem showing that the result of the exhaustive replacement can be 
represented as an intersection of a rational transduction and a regular language. 
The proof of the theorem is constructive and provides an algorithm for compu- 
ting the exhaustive replacement operator. In Section 4 we present the partial 
possibility rewriting that is a generalization of the previously introduced exhau- 
stive replacement operator. In Section 5 we define a partial lower rewriting. It 
is the largest subset of words in the partial possibility rewriting such that their 
expansions to the the database alphabet are contained in the query language. In 
Section 6 we review a typical query answering algorithm for regular path queries 
and show how two modify it into two other “lazy” algorithms for utilizing the 
partial lower and possibility rewritings respectively. The computational comple- 
xity is studied in Section 7. We show that, although exponential, the algorithms 
proposed for computing the partial possibility and partial lower rewritings are 
essentially optimal. 

2 Background 

Rewriting regular queries. Let Z\ be a finite alphabet, called the database 
alphabet. Elements of A will be denoted R, S, T, R' , S' , . . . , Ri, Si, . . ., etc. Let 
V = {Vi,...,Vn} be a set of view definitions, with each Vi being a finite or 
infinite regular language over A. Associated with each view definition Vi is a 
view name Vi. We call the set Q = {ui,...,u„} the outer alphabet, or view 
alphabet. For each Vi G 17, we set def{vi) = Vi. The substitution def associates 
with each view name Vi in 17 alphabet the language Vi. The substitution def is 
applied to words, languages, and regular expressions in the usual way (see e. g. 
[HU79]). 

A (user) query Q is a finite or infinite regular language over A. Sometimes 
we need to refer to regular expressions representing the languages Q and Vi. We 
then write re{Q) and re {Vi) respectively to denote these expressions. 

A maximal lower rewriting (1-rewriting) of a user query Q using V is a lan- 
guage Q' over 17, that includes all the words Vi^ . . . Vi^ G 17, such that 

def{vi^ ...Vi^)CQ. 
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A maximal possibility rewriting (p-rewriting) of a user query Q using V is a 
language Q" over Q, that includes all the words uq . . . Vi^ G Q, such that 

def{vi^ . . .UiJ n Q 0. 

For instance, if re{Q) is (RS)* , and we have the views Vi, V 2 , V 3 and V 4 
available, with re(Vi) = R + SS, re(V 2 ) = S, re{V 3 ) = SR and re{V 4 ,) = (RS)"^ 
respectively, the 1-rewriting is and the p-rewriting is (u 4 -I- uiu|u 2 )*- 

Semistructured databases. We consider a database to be an edge labeled 
graph. This graph model is typical in semistructured data, where the nodes of 
the database graph represent the objects and the edges represent the attributes 
of the objects, or relationships between the objects. 




Fig. 1. An example of a graph database 

Formally, we assume that we have a universe of objects D. Objects will be 
denoted a, b, c, o', b', . . . , Oi, 62 , . . ., and so on. A database DB over {D, A) is a 
pair {N, E), where N C D is a set of nodes and E C N x A x N is a set oi 
directed edges labeled with symbols from A. Figure 1 contains an example of a 
graph database. 

If there is a path labeled Ri, R 2 , ■ ■ ■ , Rk from a node a to a node b we write 
a i) Q be a query and DB = (N,E) a database. Then the answer 

to Q on DB is defined as 

ans{Q, DB) = {(a, b) G : a b for some W G Q}. 

For instance, if DB is the graph in Figure 1, and Q = {SR, T}, then ans{Q, DB) 
= {{b,d),{d, b),{c,a)} 

What are rewritings good for? In a scenario with a database and mate- 
rialized views there are various assumptions, such as the exactness/soundness/ 
completeness of the views, and whether the database relations are available, 
and if so, at what cost compared to the cost of accessing the views (see pa- 
pers [AD98,GM99,Lev99]). Depending on the application (information integra- 
tion, cache-based optimization, etc) different assumptions are valid. The use of 
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rewritings in answering user queries using views have been thoroughly investi- 
gated in the case of relational databases (see e.g. the survey [Lev99]). For the 
case of semi-structured databases much less is currently known. Notably, Cal- 
vanese et al [CGLV99] show how to obtain 1-rewritings, and the same authors, 
in [CGLV2000] discuss the possible use of 1-rewritings in information integration 
applications. The present authors show in [GT2000] how p-rewritings are obtai- 
ned and how they are profitable in information integration applications, where 
the database graph is unavailable. The paper [GT2000] shows that running an 1- 
rewriting on the view extensions is guaranteed to produce a subset of the desired 
answer, while running the p-rewriting is guaranteed to produce a superset. 

In particular, the 1-rewriting can be empty, even if the desired answer is not. 
Suppose for example that query Q is re{Q) = R\ . . . i?ioo and we have available 
two views Vi and V2, where re(Vi) = i?i . . . i?4g and re(V2) = R51 . . . Rwo- K is 
easy to see that the 1-rewriting is empty. However, depending on the application, 
a “partial rewriting” such as V 1 R 50 V 2 could be useful. In the next section we 
develop a formal algebraic framework for the partial rewritings. This framework 
is flexible enough and can be easily tailored to the specific needs of the various 
applications. In Section 6 we demonstrate the usability of the partial rewritings 
in query optimization. 



3 Replacement — A New Algebraic Operator 

In this section we introduce and study a new algebraic operation, the exhaustive 
replacement in words and languages. It is similar in spirit to the deletion and 
insertion language operations studied in [Kari91]. 

Let IT be a word, and M a e-free language over some alphabet, and let f be 
a symbol outside that alphabet. Then we define 

, , _ / {tFiflTa : 3 W2 G M such that W = W1W2W3} if non-empty 

^ otherwise. 

Furthermore, let L be a set of words over the same alphabet as M. Then define 
Pm{L) = Pm{W). We can now define the powers of pM as follows: 

Pm({W}) = pm{W), p'+^iW}) = PM{p\ri{W})). 

Let k be the smallest integer such that P^m^{{W}) = p\i{{W}). We then set 

P*m{W)=p\,{{W}). 

(It is clear that k is at most the number of symbols in IF.) 

The exhaustive replacement of a e-free language M in a language L, using a 
special symbol f not in the alphabet, can be simply defined as 

L>M= \J p*m{W). 

W&L 
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Intuitively, the exhaustive replacement L \> M replaces in every word W G L 
the non-overlapping occurrences of words from M with the special symbol f. 
Moreover, between two occurrences of words of M that have been replaced, no 
nonempty word from M remains as a subword. 

Example 1. Let L = {RSRSRSR, RRSRSR, RSRRSRRSR}, M = {RSR}. 
Then 

L>M = {f5't, RS^SR, R^SR, RRS^, ftf}, 
being the union of the sets: 

P{rsr}{RSRSRSR) = {^S^,RS^SR}, 

P{rsr}(RRSRSR) = {R^SR,RRS^}, 
pIrsr}{RSRRSRRSR) = {ttf}. 



Computing the Replacement Operation. To this end, we will give first a 
characterization of the l> operator. The construction in the proof of our characte- 
rization provides the basic algorithm for computing the result of the t> operator 
on given languages. The construction is based on finite transducers. 

A finite transducer T = {S, /, O, 6, s, F) consists of a finite set of states S, 
an input alphabet I, and output alphabet O, a starting state s, a set of final 
states F, and a transition-output function <5 from finite subsets of S' x J* to finite 
subsets of S X O*. Intuitively, for instance (qi,W) G S{qo,U) means that if the 
transducer is in state qo and reads word U, it can go to state qi and emit the 
word W. For a given word [/ G /*, we say that a word IT G O* is an output 
of T for U if there exists a sequence {qi,Wi) G 5{s,U\), {q 2 ,W 2 ) G 5{q\,U2), 
..., {qn,Wn) G 6{qn-i,Un) of State transitions of T, such that qn G F,U = 
U\. . . Un, and IT = ITi . . . IT„. We write IT G T{U), where T{U) denotes the 
set of all outputs of T for the input word U. For a language L C /*, we define 
T{L) = U[/eL ^(^)- known that T{L) is regular whenever L is. 

We are now in a position to state our characterization theorem. 

Theorem 1. Let L and M he regular languages over an alphabet A. There exists 
a finite transducer T and a regular language M' such that: 

L]> M = T{L)(1M'. 



Proof sketch. Let A = {S, A,6, sq, F) be a nondeterministic finite automaton 
that accepts the language M. Let us consider the finite transducer: 

T={SU{s'o},A,F,S',s'o,{so}), 



where F = AU {f }, and, written as a relation. 




Algebraic Rewritings for Optimizing Regular Path Queries 307 



S' = {(s, R, s' , e) : (s, R, s') € <5} U 
: i? G /i} U 

{(s'o, R, s, e) : (sq, R, s) G <5} U 

{(sq, R, Sg, t) : (sg, R,s)gS and s G F} U 

{(s, R, Sg, t) : (s, R, s') G S and s' G F}. 

Intuitively, transitions in the first set of 6' are the transitions of the “old” auto- 
maton modified so as to produce e as output. Transitions in the second set mean 
that “if we like, we can leave everything unchanged,” i.e. each symbol gives itself 
as output. Transitions in the third set are for jumping non-deterministically from 
the new initial state Sg to the states of the old automaton A, that are reachable 
in one step from the old initial state sg. These transitions give e as output. Tran- 
sitions in the fourth set are for handling special cases, when from the old initial 
state qo, an old final state can be reached in one step. In these cases we can 
replace the one symbol words accepted by A with the special symbol f. Finally, 
the transitions of the fifth set are the most significant. Their meaning is: in a 
state, where the old automaton has a transition by a symbol, say R, to an old 
final state, there will in the transducer be an additional transition i?/f to Sg, 
which is also the (only) final state of T. Observe, that if the transducer T decides 
to leave the state Sg while a suffix U of the input string is unscanned, and enter 
the old automaton A, then it can return back only if there is a prefix U' of U , 
such U' G L(A). In this case the trasducer replaces U' , which is a subword of 
the input string, by the special symbol f. 

Given a word of IT G T as input, the finite transducer T replaces arbitrary 
many occurences of words of M in IT with the special symbol symbol f. 

For an example, suppose M is R{SR)* + RST. Then an automaton that 
accepts this language is given in Figure 2 drawn with solid arrows. The corre- 
sponding finite transducer is shown in the same figure in the right. It consists of 
the automaton A, whose transitions now produce as output e, plus the state Sg 
and the additional transitions drawn with dashed arrows. 

It can now be shown that 

T{L) = LU {C/ifC/ 2 t ■ • ■ Wk ■ for some U in L and words Wi in M, 

U = UiWiU2W2...Wu-iUk}. 

From the transduction T{L) we get all the words of L having replaced in them 
an arbitrary number of words from M . What we like is not an arbitrary but an 
exhaustive replacement of words from M. To achive this goal we will intersect 
the language T{L) with a regular language M' which will serve as a “mask” for 
the words oi L\> M. We set 



M' = (r*MF*)G 



Now M' guarantees that no other candidate for replacing occurs inside the words 
of the final result. □ 
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S/S 




Fig. 2. An example of the construction of a replacement transducer 



4 Partial P-Rewritings 

We can give a natural generalization of the definition of the replacement operator 
for the case when we like to exhaustively replace subwords not from one language 
only, but from a finite set of languages (such as a finite set of view definitions) . For 
this purpose, let VF be a word and M = {Mi, . . . , M„} be a set of languages over 
some aplhabet, and let {f^, . . . , f„} be a set of symbols outside that alphabet. 
Now we define 

f • 3 W 2 G Mi such that W = W 1 W 2 W 3 } if non-empty 

Pm( \{kF} otherwise. 

Then, is defined similarly to 

The generalized exhaustive replacement of M = {Mi, . . . , M„} in a language 
L, by the corresponding special symbols fi, . . . , f„, is 

T>M= U p*^{W). 
weL 

In the following we will define the notion of the partial p-rewriting of a da- 
tabase query Q using a set V = |Vi, . . . , of view definitions. 

Definition 1 The partial p-rewriting of a query Q over A using a set V = 
{Vi, . . . ,Vn} of view definitions over A is 

Q>V, 

with 17 = {vi, . . . ,Vn} as the corresponding set of special symbols. 

As a generalization of Theorem 1 we can give the following result about the 
partial p-rewriting of a query Q over A using a set V = {Vi, ... ,Vn} of view 
definitions over A. 



Theorem 2. The partial p-rewriting Q [> V can he effectively computed. 
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Proof sketch. Let Ai = {Si,A,5i,SQi,Fi), for i G [l,n] be n nondeterministic 
finite automata that accept the corresponding Vi languages. Let us consider the 
finite transducer: 



T = (S'! U . . . U S'„ U {s'o}, A,A\J[2, S', s'g, {s'p}), 



where 



S' = {(s, R, s', e) : (s, R, s') G Si, i € [1, n]} U 
{(sg, i?, Sg, i?) : R G A} U 
{(sg, R, s, e) : (soi, R, s) G Si, i G [1, n]} U 
{(s'g,.R, Smti) : (soi,R,s) G Si and s G if*, iG [l,n]} U 
{(s, i?, s'g,tj : {s,R,s') G Si and s' G Fi, i G [l,n]}. 

The transducer T performs the following task: given a word of Q as input, it 
replaces nondeterministically some words of Vi U . . . U from the input with 
the corresponding special symbols. The proof of this claim is similar as in the 
previous theorem. 

From the tranduction T(Q) we get all the words of Q having replaced in 
them an arbitrary number of words from Vi U . . . U But what we like is the 
exhaustive replacement Ql>V. For this we intersect the language T{Q) with the 
regular language 



{{A U f?)* (Ri U . . . U Vn){A U n)*y , 

which will serve as a mask for extracting the words in the exhaustive replacement. 

□ 

We note here that the partial p-rewriting of a query is a generalization of the 
p-rewriting. Indeed, consider the the substitution from fl \J A that maps each 
Vi G 17 to the corresponding regular view language Vi and each database symbol 
R G A to itself. This substitution is the extension of the def substitution to 
the A alphabet and we call it def'. Then the partial p-rewriting is the set of 
all the words W on 17 U Z\, with no subwords in any of Vi, . . . , Vn, such that 
def' {W) has a non empty intersection with Q. The conceptual similarity of the 
partial p-rewriting with p-rewriting can also be observed in another way; change 
the above mask to 17* and the result will be the p-rewriting, as opposed to the 
partial p-rewriting. 

5 Partial L-Rewritings 

We defined the 1-rewriting of a query Q given a set of view definitions V = 
{Vi, ... ,Vn} as the set of all the words W on the view alphabet 17 such that 
def{W) is contained in the query language Q. In the same spirit we will define 
the partial 1-rewriting. It will be the set of all “mixed” words W on the alphabet 
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f2 U A, with no subword in Vi U • • • U such that their substitution by the 
extended def' is contained in the query Q. The condition that there is no subword 
in U • • • U says that in fact the partial 1-rewriting is a subset of the partial 
p-rewriting. 



Definition 2 The partial l-rewriting of a query Q on Z\ is the language Q' on 
Qua given by 

g' = {TT G (Q > V) : def'iW) C Q}. 

We now give a method for computing the partial l-rewriting, given a query 
g and a set V = {Vi, . . . ,V„} of view definitions as input. 

Algorithm 1 1. Compute the complement Q'^ of the query. 

2. Construct the transducer T used for the partial p-rewriting. Then compute 
the transduction T{Q‘^). 

3. Compute the complement (T of the previous transduction. 

4-. Intersect the complement {T (g°))° with the mask 

M={{AU Q)* (tfi U . . . U Vn){A U 12)*)'' 

Denote with Q' the result. □ 



Theorem 3. The mixed QU A language Q' gives exactly the partial l-rewriting 
of Q. 

Proof “C”. T{Q‘^) is the set of all words W on QuA such that def'{W)r\Q^ yf 0. 
Hence, (T (g°))", being the complement of this set, will contain only QuA words 
such that all the Z\- words in their substitution by def' will be contained in Q. 
This is the first condition for a word on 12 U Z\ to be in the partial l-rewriting 
of g. Furthermore, intersecting with the mask M we keep in (T (g°))" only the 
12 U Z\ words that do not contain A subwords in Vi U • • • U This is the second 
condition for a word on 12 U Z\ to be in the partial l-rewriting of Q. 

“D” . We will prove this direction by a contradiction. First observe that both 
the partial l-rewriting and the set Q' are subsets of the partial p-rewriting, that 
is, all their words “pass” the mask M. In other words their words do not have 
subwords in Vi U • • • U Suppose now, that the mixed 12 U Z\-word W is in 
the partial l-rewriting but not in Q' . That is def'{W) C Q. On the other hand, 
since IF ^ g' it follows that IF G Q'^ which means that IF G T{Q‘^) U M‘^. But 
as we mentioned before, the word IF, which belongs in the partial l-rewriting, 
“passes” the mask M and this implies that it cannot “pass” the complement of 
the mask. Therefore, IF G T{Q^). Thus def'{W)nQ‘^ 0 that is, def'{W) % Q, 
i.e. IF cannot be in the partial l-rewriting, a contradiction. □ 




Algebraic Rewritings for Optimizing Regular Path Queries 311 



6 Query Optimization Using Partial Rewritings and 
Views 

In this section we show how to utilize partial rewritings in query optimization 
in a scenario where we have available a set of precomputed views, as well as 
the database itself. The views could be materialized views in a warehouse, or 
locally cached results from previous queries in a client/server environment. In 
this scenario the views are assumed to be excact, and we are interested in an- 
swering the query by consulting the views as far as possible, and by accessing 
the database only when necessary. 

Formally, let 17 = {v \, . . . u„} be the view alphabet and let V = {U, • ■ • , U} 
be a set of view definitions as before. Given a database DB, which is a graph, 
where the edges are labelled with database symbols from A, we define the view 
graph V over (V, Q) to be a database over {D, 17) induced by the set 

[J {{a,Vi,b) ■. {a,h) & ans{V^,DB)). 

iG{l,...,n} 

of 17-Iabelled edges. 

It is now straightforward to show, that if the 1-rewriting Q' is exact (meaning 
def{Q') = Q), then ans{Q, DB) = ans{Q', V) (see Calvanese et al. [CGLV2000]). 

However, the cases when we are able to obtain an exact rewriting of the query 
using the views would be rare in practice, in the general we have in the views 
only part of the information needed to answer the query. So, should we ignore 
this partial information only beacuse it is not complete? In the previous sections 
we showed how this partial information can be captured algebraically by the 
partial rewritings. In the following, we use the partial rewritings not to avoid 
completely accesing database, but to minimize such access as much as possible. 

However, in order to be able to utilize the partial 1-rewriting Q' , it should be 
exact, i.e. we require that def'{Q') = Q. We can use for testing the exactness 
the optimal algorithm of [GGLV99]. 

Given an exact partial 1-rewriting, we can use it to evaluate the query on the 
view-graph, and accessing the database in a “lazy” fashion, only when necessary. 
Before describing the lazy algorithm, let us review how query answering on 
semistructured databases typically works [ABS99]. 

Algorithm 2 We are given a regular expression for Q and a database graph 
DB. First construct an automaton Aq for Q. Let N be the set of nodes in the 
database graph, and Sq be the initial state in Aq. For each node a € N compute 
a set Reacha as follows. 

1. Initialize Reacha to {(a, Sq)}- 

2. Repeat 3 until Reacha no longer changes. 

3. Ghoose a pair (x, s) G Reacha. If there is a database symbol R, such that 
a transition s — ^ s' is in Aq and an edge a — ^ a' is in the database DB, 
then add the pair {x' , a') to Reacha. 
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Finally, set 

ans{Q, DB) = {(a, b) : a € N, (b, s) G Reacha, and s is a final state in Aq}. 

□ 

In the following we modify this algorithm into a lazy algorithm for answering 
a query Q using its partial 1-rewriting with respect to a set of cached exact views. 

Algorithm 3 We are given an automaton Aq^, corresponding to an exact par- 
tial 1-rewriting Q' and the view graph V. Let N be the set nodes in V, and sq 
be the initial state in Aq/ . For each node a G N then compute a set Reacha- 

1. Initialize Reacha to {(a, So)}> and Expandeda to false. 

R 

2. For each database symbol R, if there is in Aqi a transition sq — > s from 
the initial state sq, then access the database and add to V the subgraph of 
DB induced by the i?-edges. 

3. Repeat 4 until Reacha no longer changes. 

4. Choose a pair (x, s) G Reacha- If there is a view or database symbol R, such 

R 

that a transition s — > s' is in Aq/, go to 5. 

5. If there is an edge a — > a' in the viewgraph, add the pair [x' , a') to Reacha- 
Otherwise, if Expandeda = false, set Expandeda = true, access the data- 
base and add to V the subgraph of DB induced by all edges originating 
from a. 

Set eval{Q' ,V , DB) = 

{(a, b) : a G N, (b, s) G Reacha, and s is a final state in Aq/}. 

□ 

Theorem 4. Given a query Q and a set of exact views, if the partial l-rewriting 
Q' of Q is exact, then eval{Q' ,V ,DB) = ans{Q,DB)- 

Next, let us discuss how to utilize the partial p-rewriting Q" of a query Q 
for computing the answer set ans(Q,DB)- If we use the same algorithm as in 
the case of the partial l-rewriting we might get a proper superset of the answer. 
Note however that, contrary to Algorithm 3, in any case the partial p-rewriting 
does not need to be exact. 

Theorem 5. Given a query Q and a set V of exact views, if Q" is the partial 
p-rewriting of Q using V, then ans{Q,DB) C eval{Q" ,V,DB)- 

In other words, we are not sure if all the pairs are valid. To be able to discard 
false hits, suppose that the views are materialized using Algorithm 2. We can 
then associate each pair (a, b) in the view graph with their derivation. That 
is, for each pair (a, 6) connected with an edge, say Vi, in the view graph, we 
associate an automaton, say Aab, with start state a and final states {6}. What 
is this automaton? For each pair (a, 6), we can consider the database graph as 
a non-deterministic automaton DBab with initial state a and final states {&}. It 
is now easy to see that 

Aab = DBab nAy. 

where Ay is an automaton for the view Vi- We are now ready to formulate the 
algorithm for using the partial p-rewriting in query answering. 




Algebraic Rewritings for Optimizing Regular Path Queries 313 



Algorithm 4 

1. Compute eval{Q", V, DB) using Algorithm 3. During the execution of Algo- 
rithm 3 the view graph V is extended with new edges and nodes as described. 
Call the extended view graph V'. 

2. Replace in V’ each edge labeled with a view symbol, say Vi, between two 
objects a and b with the automaton Aab of the derivation. Call the new 
graph V”. 

3. Set verified{Q" ,V, DB) = eval{Q" ,V, DB)n{{a,b) : ans(Q,V"{,) yf 0}, where 

is a non-deterministic automata similar to Dab- 

Theorem 6. Given a query Q and a set V of exact views, if Q" is the partial 
p-rewriting of Q using V, then verified{Q”,V,DB) = ans{Q,DB). 

7 Complexity Analysis 

The following theorem establishes an upper bound for the problem of generating 
the exhaustive replacement L t> M , where L and M are regular languages. 

Theorem 7. Generating the exhaustive replacement of a regular language M 
from another language L can be done in exponential time. 

Proof. Let us refer to the cost of the steps in the constructive proof of the Theo- 
rem 1 . To construct a non-deterministic automaton for the language M and using 
it to construct the transducer g is polynomial. To compute the transduction of 
the regular language L, g{L), is again polynomial. But at the end, in order to 
compute the subset of the words in g{L), to which no more replacement can be 
applied, is exponential. This is because we intersect with a mask that is a langu- 
age described by an extended regular language containing complementation. □ 

Theorem 8. Let P be an alphabet and A, B be regular languages over P. Then 
the problem of deciding the emptiness of A C\ {P* B P*)'^ is PSPAGE complete. 

We are now in a position to prove the following result. 

Theorem 9. There exist regular languages L and M , such that the exhaustive 
replacement L \> M cannot he computed in polynomial time, unless PTIME = 
PSPAGE. 

Proof. Suppose that given two regular expressions A and B on alphabet P we 
like to test the emptiness of A fl {P*BP*)‘^. Without loss of generality let us 
assume that there exists one symbol in A that does not not appear in R. To see 
why even with this restriction the above problem of emptiness is still PSPACE 
complete, imagine that we can simply have a tape symbol which does not appear 
at all in the definition of the transition function of the Turing machine. Then 
this symbol will appear in the above set A but not in B (see Appendix). Let 
us denote this special symbol with f. We substitute the f symbol in A with the 
regular expression B. The result will be another regular expression A' which has 
polynomial size. Clearly, A fl (P*BP*)'^ = AD {A' \> B). 
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As a conclusion, if we had a polynomial time algorithm producing a polyno- 
mial size representation for A' [> B, we could polynomially construct an NFA for 
A n {A' \> B). Then we could check in NLOGSPACE the emptiness of this NFA. 
This means that, the emptiness of A fl {r*BF*Y could be checked in PTIME, 
which is a contradiction, unless PTIME=PSPACE. □ 

Corollary 1 The algorithm in the proof of Theorem 2 for computing the partial 
p-rewriting of a query Q using a set V = {Vi, . . . , 14,} of view definitions, is 
essentially optimal. 

Theorem 10. Given a query Q and a set V = {Vi , . . . , V„} of view definitions, 
the partial l-rewriting can he computed in 2EXPTIME. 

Proof. Let us refer to the constructive proof of the Theorem 3. To compute the 
complement Q‘^ of the query is exponential. To transduce it to T{QY is polyno- 
mial. To complement again is exponential. So, in total we have 2EXPTIME. To 
compute the mask is EXPTIME and to intersect is polynomial. Finally, 2EX- 
PTIME -h EXPTIME = 2EXPTIME. □ 

For the partial lower rewriting we have the following. 

Theorem 11. Algorithm 1 for computing the partial l-rewriting of a query Q 
using a set V = {Vi, . . . , 14,} of view definitions, is essentially optimal. 

Proof. Polynomially intersect the partial l-rewriting with 17* and get the 1- 
rewriting of [CGLV99]. But, the l-rewriting is optimally computed in doubly 
exponential time in [GGLV99], so our algorithm is essentially optimal. □ 
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Abstract. With the proliferation of database views and curated data- 
bases, the issue of data provenance - where a piece of data came from 
and the process by which it arrived in the database - is becoming increa- 
singly important, especially in scientific databases where understanding 
provenance is crucial to the accuracy and currency of data. In this pa- 
per we describe an approach to computing provenance when the data 
of interest has been created by a database query. We adopt a syntactic 
approach and present results for a general data model that applies to re- 
lational databases as well as to hierarchical data such as XML. A novel 
aspect of our work is a distinction between “why” provenance (refers to 
the source data that had some influence on the existence of the data) 
and “where” provenance (refers to the location(s) in the source databases 
from which the data was extracted). 



1 Introduction 

Data provenance — sometimes called “lineage” or “pedigree” — is the descrip- 
tion of the origins of a piece of data and the process by which it arrived in a 
database. The field of molecular biology, for example, supports some 500 public 
databases [1], but only a handful of these are “source” data in the sense that 
they receive experimental data. All the other databases are in some sense views 
either of the source data or of other views. In fact, some of them are views of 
each other, which sounds nonsensical until one understands that the individual 
databases are not simply computed by queries, but also have added value in the 
form of corrections and annotations by experts (they are “curated”). A serious 
problem confronting the user of one of these databases is knowing the provenance 
of a given piece of data. This information is essential to anyone interested in the 
accuracy and timeliness of the data. 

Understanding provenance and the process by which one records it is a com- 
plex issue. In this paper we address an important part of the general problem. 
Suppose a database (a view) V = Q{D) is constructed by a query Q applied to 

* This work was partly supported by a Digital Libraries 2 grant DL-2 IIS 98-17444 
** Supported in part by an Alfred P. Sloan Research Fellowship. 



J. Van den Bussche and V. Vianu (Eds.): ICDT 2001, LNCS 1973, pp. 316—330, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




Why and Where: A Characterization of Data Provenance 



317 



databases D and we ask for the provenance of some piece of data d in Q{D): 
what parts of the database D contributed to dl The problem has been addressed 
by [7,2] for relational databases. In particular [7] considers the question : given 
a tuple in Q{D) what tuples in D contributed to it. The crucial question here is 
what is meant by “contributed to” . By examining provenance in a more general 
setting we draw a distinction between “where-provenance” - where does a given 
piece of data come from and - “why-provenance” - why is it in the database. 
Consider the following example: 

SELECT name, telephone 

FROM employee 

WHERE salary > SELECT AVERAGE salary FROM employee 
If one sees the tuple ("John Doe", 1234) in the output one could argue that 
every tuple in contributed to it, for modifying any tuple in the employee relation 
could affect the presence of ("John Doe" , 1234) in the result. This is the why- 
provenance and it is what is studied in [7] as the set of contributing tuples. On 
the other hand, suppose one asks where the telephone number 1234 in the tuple 
("John Doe ",1234) comes from, the answer is apparently much simpler: from 
the telephone field “John Doe” tuple in the input. This statement presupposes 
that name is a key for the employee relation; if it is not we need some other means 
of identifying the tuple in the source, for SQL does not eliminate duplicates. (Had 
we used SELECT UNIQUE the answer would be a set of locations.) The point 
is that where-provenance requires us to identify locations in the source data. 
Where-provenance is important for understanding the source of errors in data 
(what source data should John Doe investigate if he discovers that his telephone 
number is incorrect in the view.) It is also important for carrying annotations 
through database queries. Therefore as a basis for describing where-provenance, 
we use the data model proposed in [6] in which there is an explicit notion of 
location. The model has the advantage that it allows us to study provenance in 
a more general context than the relational model. Existing work on provenance 
considers only the relational model. 

Outline. In the next section we describe the deterministic model in [6]. We then 
give a syntactic characterization of why-provenance and show that it is invariant 
under query rewriting. To this end, in Section 3 we describe a natural normal 
form for queries and give a strong normalization result for query rewriting. The 
normal form is useful because it also gives us a reasonable basis for defining 
where-provenance which turns out to be problematic and cannot, in general be 
expected to be invariant under query rewriting. We discuss a possible restriction 
for which where-provenance has a satisfactory characterization. 

Related work. Why-provenance has been studied for relations in [2,7]. To our 
knowledge no-one has studied where-provenance. A definition of why-provenance 
for relational views is given in [7], which also shows how to compute why- 
provenance for queries in the relational algebra. There, a semantic characteriza- 
tion of provenance is given which, when restricted to SPJU, has the expected 
properties such as invariance under query rewriting. In fact, the syntactic tech- 
niques developed in this paper, when restricted to a natural interpretation of the 
relational model, yield identical results to those in [7]. We do not know whether 
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there is semantic characterization for where-provenance nor do we know whether 
there is a semantic characterization of why-provenance that is well behaved on 
anything beyond than SPJU queries. 

Expressing the why-provenance for a query is loosely related to the view 
maintenance problem [17]. It is apparently simpler in (a) that in why-provenance 
we are not interested in what is not in the view (view maintenance needs to 
account for additions to the database) and (b) that we are not asking how to 
reconstruct a view under a change to the source. Conversely, there is a loose 
connection between where-provenance and the view update problem. (If I want 
to update a data element in the output, what elements in the input need to be 
changed.) Recently, [10] has proposed using the deterministic model described 
here for view maintenance in scientific databases. 



2 A Deterministic Model 

We describe the data model in [6] where the location of any piece of data can 
be uniquely described by a path. This model uses a variation of existing edge- 
labeled tree models for semistructured data [14,13]. It is more restrictive in that 
the out-edges of each node have distinct labels; it is less restrictive because these 
labels may themselves be pieces of semistructured data^. Figure 1 shows how 
certain common data structures can be expressed in this “deterministic” model 
of semistructured data. Here, any node in the deterministic tree is uniquely 
determined by a path of edge labels from root node to that node. These paths 
are analogous to ^-values in programming language terminology. We will describe 
shortly how relations can be cast in this model by using the keys as edge labels. 
Any object-oriented or semistructured database with persistent object identifiers 
for all structures can also be expressed. There is also a variety of hierarchical data 
formats that implicitly conform to this model. Notably ACeDB [9], a lightweight 
DBMS originally developed as a database for genetic data conforms rather closely 
to this model and also supports certain operations such as “deep union” which 
are essential to the techniques developed in this paper. 



2.1 Syntax and Operations 

Values. We use the notation x:y to denote a pair whose label is x and value is 
y. We can think of x as the edge label and y as the subtree under it. We use 
the notation {xi:yi,...,Xn'.yn} to denote a set of such pairs. Since the edge-labels 
Xi, ...,Xn are distinct, this notation describes a finite partial function from values 
to values. A set of values {si,...,s„} can always be described in our model by 
mapping each element in the set to some standard constant (c in Figure 1). The 
last example shows how edge labels can be themselves pieces of semi-structured 
data. Value equality can be computed inductively. 

^ For the purposes of normal forms, these pieces of semistructured data are required 
to be “linear”. 
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Namp^ hfeight 
"Bruce" 6.2 

A record 



1/^ 




I 1/ 2 




{\&Ay ^^2} ' 

A A ' 

Name Rate Nan^ Rate ^ 2 


"a" "b" 


"c" c c 


c 


"Kim" 50 "Bob" 75 


An array 


A set 




2:"b", 3:"c 


;"} {1:c, 2:c, 4:c} 


A relation 

{ {ld;1} : {Name:"Kim", Rate:50}, 
{ld:2} : {Name:"Bob", Rate:75} } 



Fig. 1. Examples of data structures represented in our syntax. 



Paths. We use the notation x\.X2 Xn for paths. In the last example of 

Figure 1, the path {id: 1} identifies the value (Name: "Kim" , Rate:50} and the 
path {id: Ij.Rate identifies the value 50. 

Abbreviation. We use ei.C2 e„_i:e„ as a shorthand for {ei:{c2: 

{...{e„_i:e„}. ..}}}. We can think of 61.62 6„_i as the path leading to the value 

6„. For example, {id: 1} .Name : Kim is an abbreviation for {{id: 1} : {Narnie :Kim}}. 

Traversal. We use v{p) to denote the subtree identified by a path p in value v. 
If path p does not occur in v, then v{p) is undefined. For example: {a : 1 , b : 2} (c) 
is undefined while {{c : 3} : 1 ,b : 2} ({c : 3}) is 1. 

Path representation. Observe that any value in our model can be described 
by specifying the set of all paths to the constants at the terminal nodes. We 
call this the path representation of v. For example, the path representation of 
{a:{l : c,3:d}} is {(a. 1 ,c) , (a.3,d)}. 

Definition 1. (Substructure) w is a substructure of v, denoted as w C u, if 
the path representation of w is a subset of the path representation of v. □ 

Example. a:{l:c,3:d}Ca: { l:c , 2:b , 3:d} but a:{l:c,3:d}^b.a:{l:c,3:d}. It 
is easy to see that since our model is deterministic, if ru C u then w occurs as a 
part of u in a unique way. 

Definition 2. (Deep Union) The deep union of v\ with V 2 , written as v\Uv2 
is the value whose path representation is the union of the path representations 
of v\ and V2- Note that the result may not be a partial function in which case 
the deep union is undefined. □ 

Example. The deep union of {a:l,b.c:2} and {b.d:4,e:5} is {a:l,b:{c:2, 
d:4} , e : 5 } while the deep union of {a: 1 ,b . c : 2} and {b . c : 3, e : 5} is undefined. 



2.2 An Encoding of Relations 

We can encode relations as follows. Each relation name forms the label of an 
outgoing edge from the root node which is in turn mapped to the set of keys 
from that relation. Each key of a relation is then mapped to the corresponding 
tuple it identifies in the relation. If there is no key, the tuples are modeled as a 
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where pi G ei. 


where spi G D\, 


where Composers. x.bornirt G D, 






u < 1700 


Pn G e„, 


SPn G D„ 


collect {year:u}:C 


condition 


condition 




collect e 


collect se 




(a) 


(b) 


(c) 



Fig. 2. (a) General form and (b) normal form of a query fragment, (c) An example. 



set, that is, the entire tuple becomes an edge label. As an example, suppose we 
have two relations Composers and Works as shown below. The key for Composers 
is name and Works has a compound key (name, opus). The figure below also 
shows the encoding of the relations into our model. We see that keys of a tuple 
are placed on an edge in our model. If a tuple contains a compound key, we 
could model the entire compound key as a “linear” piece of semistructured data 
on the edge. That is, each key is placed one after another on the same edge. It 
does not matter which order we serialize the keys so long as this is done in a 
consistent manner. 



Composers 



name 


born 


period 


"J.S. Bach" 


1685 


” baroque” 


"G.F Handel" 


1685 


” baroque” 


”W.A Mozart" 


1756 


"classical” 



Works 



name 


opus 


title 


"J.S. Bach" 
"J.S. Bach" 
"G.F Handel” 


"BMV82” 

"BMV552” 

"HMV19” 


" 1 have enough.” 
NULL 

"Art thou troubled?” 



{ Composers: 

{{name : " J . S . Bach"}: {born:1685, period: "baroque"}, 
{name:"G.F. Handel"}: {born: 1685, period: "baroque"}, 
{name: "W. A. Mozart"}: {born: 1756, period: "classical"}}, 
Works: {{{name : " J . S . Bach"} . opus : "BMV82"} : 

{ title: "I have enough." }, 

{{name : " J . S . Bach"} . opus : "BMV552"} : 

{ title: "-" }, 

{{name: "G.F Handel"}. opus: "HMV19"}: 

{ title: "Art thou troubled?" }} } 



2.3 XML 

At first sight, XML does not conform to a deterministic model. Insofar as some 
formal model for XML has been developed in the Document Object Model 
(DOM) [15] it is that of a node-labeled graph in which child labels may be 
repeated. The fact that it is node labeled is a minor irritant. Uniqueness is more 
serious. However, in the absence of any system of keys (see [16]) we can still fall 
back on the property, specified by the DOM, that child nodes can be uniquely 
identified by their positions and attribute nodes by their names. We defer the 
details of the translation of XML and a query language such as XML-QL [8] into 
our deterministic model and query language to the full version of this paper. 



3 A Query Language 

Query languages for semistructured data [3] are based on a general syntactic 
form shown in Figure 2(a). The piS are patterns whose syntax follows the syntax 
for data (as defined in the previous section) augmented with variables^. Expres- 
sions e, ei, ..., e„ are essentially the same as patterns but may contain “where ... 

^ In semistructured query languages, patterns can also include regular expressions on 
the edge labels. We will not deal with such patterns in this paper. 
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collect...” expressions (nested queries), condition is simply a boolean predicate 
on the variables of the query. 

The interpretation of a “where ... collect...” expression is as follows: consider 
each assignment of the variables in the expression that makes each pattern pi 
a substructure of the corresponding expression Ci. For each such assignment, 
evaluate the condition. If it is true, add the (instantiated) value of e to the 
output. Finally “union” together the output values. This interpretation is quite 
general, but to make it precise we must (a) define what we mean by “union” and 
(b) say what values a variable can bind to (a constant, an arbitrary value, or 
something in between). Languages, see [13,5] vary in their choice of the “union” 
operation. In our case, we use the deep union operation. Thus the output of the 
query in Figure 2(c) is {year: 1685} :C even though this value is emitted twice. 
A consequence of this is that the result of a query maybe undefined. 

We add the deep union operation to our language, and the general syntax 
can be summarized by the following grammar: 

e ::=where p G e, . . . ,p G e, condition collect e | e U e | (e : e| | c | a; 
where c ranges over constants, x over variables, p over patterns and condition 
over conditions. Note that {ei : e},... ,e„ : e'^} is in fact a shorthand for 
{ci : e}} U ... U (e„ : e'^}. We refer to this query language, for want of a 
better term, as DQL (Deterministic QL). The syntax of the query language is 
quite general, but its interpretation is limited by the model. In order to set up 
the machinery to analyze provenance, we will make some restrictions both on 
the syntax and interpretation of queries for the soundness of our rewrite rules. 
First, we impose some syntactic restrictions. 

Definition 3. (Well-Formed Query) A query Q is said to be well-formed if 
(a) no pattern pi is a single variable, (b) each expression is either a (nested) 
query or an expression that does not involve a query, and (c) each comparison 
is between variables or between variables and constants only. □ 

Conditions (a) and (b) are required for the soundness of our rewrite rules. 
Condition (c) restricts our queries to the “conjunctive” fragment for which con- 
tainment of queries can be easily determined. In addition to well-formedness, we 
say a query is well-defined if it is not undefined on any input. For the rest of the 
paper, we consider only queries that are both well-formed and well-defined. The 
next restriction we place is on the interpretation of a query. For this, we need 
the notion of a singular expression, which consists of a single path terminated 
by a constant or variable. 

Definition 4. (Singular expression) A expression e is singular if e yf (eiUe 2 ) 
for any non-empty and distinct expressions ei and 62 . □ 

Our restriction on the interpretation is that variables may only bind to sin- 
gular values. At first sight, this seems very restrictive and the interpretation of 
a query is unusual. Consider the query (a) in Figure 3. It binds singular va- 
lues to y, and the output is {{name:"J.S. Bach"} .born: 1685, (name:"G.F. 
Handel"} .born: 1685}. This is probably not the expected output for someone 




322 



P. Buneman, S. Khanna, and W.-C. Tan 



where Composers. a; : j/ G D, 
born.M G y, 
u < 1700 
collect X : y 
(a) 



where Composers.® : t/ G D, 
born.tt G y, 
Composers.® : 2 G D, 
u < 1700 
collect X : z 

(b) 



Fig. 3. More example DQL queries. C denotes some constant 



familiar with, say, XML-QL in which variables bind to complete subtrees. Ho- 
wever there is an easy translation, illustrated in query (b) from the XML-QL 
interpretation^ into DQL. Note that the deep union reconstructs the subtree. 

Restrictive as it may seem, DQL can capture positive (SPJU) relational que- 
ries and positive nested relational algebra ([12]). It is less expressive than XML- 
QL in that (a) it cannot express path patterns involving a Kleene-star(*), (b) it 
works only on hierarchical structures, and (c) the forms of Skolem function and 
nested query forms in XML-QL that can be simulated are limited. We omit the 
details in this paper. 

Definition 5. (Normal Form) A query Q is said to be in normal form if Q 
has the form Qi U ... U where each Qi is as shown in Figure 2(b). spi and 
se is a singular pattern and singular expression respectively. Di is a database 
constant and condition is a boolean predicate on the variables of the query. □ 

Our main result in this section is that every well-formed query has an equi- 
valent normal form which can be determined from our rewrite system TZ. We 
omit the details of TZ and state the strong normalization result which says that 
starting from any well-formed query, any sequence of application of rewrite rules 
leads to a normal form in a finite number of steps. 

Theorem 1. (Strong Normalization) The rewrite system TZ is strongly nor- 
malizing. 



4 Two Meanings of Provenance 



Equipped with a data model and query language, we are now in a position 
to formulate two meanings of provenance and to compute the provenance of a 
component d in a view V = Q{D) where Q is as query and D is the source 
data. We will formulate the provenance of d as a query Q' that is completely 
determined by Q, D and d. 

{ {name:"J.S. Bach"}: 

where {bornriesB, 



Com pose rs. { na me:® }.{ born :u, period:®} 
Works. {{name:®}. opus:ti;}:j/ G D 
collect 

{name:®}. {born:®, {opus:ui}:y} 



G D, {opus : "BMV82"} : {title : "I have enough."}, 
{opus : "BMV552"} : {title : "-"} } , 

{name : " G . F . Handel"}: 

{born: 1685, 

{opus : "HMV19"} : {title : "Art thou troubled?"} 
} } 



® We assume that the XML-QL interpretation contains a skolem function that groups 
by composer names. 
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The above query, say Qi expresses a join on components of the database de- 
scribed in Section 2.2. Consider the value referenced by {name: "G.F Haindel"}. 
born. This value was generated by Qi as any instance of the “collect” expression 
in which the variable x was bound to "G.F Hamdel" and u to 1685. We now look 
at the patterns in the “where” clause to find what (simultaneous) matches of 
these patterns caused these bindings. In this case there is only one such match 
consisting of the patterns (after instantiating the variables): 

Composers . {name : "G. F .Handel"} . {born: 1685, period: "baroque"} 

Works . {{name : "G.F. Handel"} . opus : "HMV19"} .title : "Art thou troubled?" 
Moreover if we apply Qi to any database that contains these structures, we 
will obtain an output that contains {name: "G.F. Handel"} .born: 1685. This 
is the rationale for calling these structures the why-provenance of the value re- 
ferenced by {name: "G.F Hatndel"} .born. However, if we are interested in the 
where-provenance of {namie: "G.F Hcoidel " } . born, we only need to look at the 
pattern(s) that bind the variable u to determine that it came from the path 
Composers . {name : "G . F Handel"} .born. 

Our example suggests that one natural approach to compute provenance is 
via syntactic analysis of the query and this is the approach that we take. 

5 Why-Provenance 

In the model-theoretic approach to datalog programs described in [4], these pro- 
grams are viewed as a set of first-order sentences describing the desired answer. 
For example, if we have a datalog rule R{u) : — i?i(ui), ...,i?„(u„), we could as- 
sociate the logical sentence: Vxi, ...,Xm Aie[i n] R{u) where x\, ...,Xm 

are variables occuring the in the rule. A DQL query {e \ pi € D,...,pn € 
D, condition} could be viewed as the following logical sentence: Vxi,...,Xm 
(Aie[i n]Pi S ^ condition) — >■ e is in the output. xi,...,Xm are varia- 
bles which occurs in the query. Therefore a value v is provable if there exists a 
valuation that will make the premise true and puts v in the output. 

As discussed earlier, the structures in the why-provenance example corre- 
spond to a proof for {name: "G.F Handel"} .born: 1685. We call the collection 
of values taken from D that proves an output, a witness for the output. More 
specifically, we say a value s is a witness for a value t with respect to a query Q 
and a database H, if f Cl (5(s) and s Q D. The value shown below is a witness 
for {name: "G.F Handel"} .born: 1685. 

f Composers . {name : "G .F . Handel"! . {born: 1685, period: "baroque"}. 

Works . {{name : "G .F. Handel"} . opus : "HMV19"} .title : "Art thou troubled?" } 



5.1 Witness Basis 

We now refine the notion of witness as introduced above to be explicitly tied to 
the structure of a given query as well as an input database. Specifically, for a 
singular value t, we only consider witnesses that correspond to the deep union 
of values taken from D (at the leaves of a proof tree for t) with respect to 
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a query Q. For Qi and output {name:"G.F Handel"} .born: 1685, the witness 
above corresponds to values at the leaves of the proof tree taken from D. The 
following is also a witness for the same value but it is not the result of deep 
union of values at the end of any proof tree for that value. 

{ Composers . {-[naiiie : "G . F . Handel " I. fborn: 1685, period: "baroque"}, 

■[name:"W.A Mozart"} . {born: 1756, period: "classical"}}. 

Works . {{name : "G.F. Handel"} . opus : "HMV19"} .title : "Art thou troubled?" } 

We describe next our notion of a witness basis which captures the set of all 
witnesses of the former type for any value t in Q{D). Our definition closely 
follows the syntax of the query. 

Definition 6. (Witness Basis) Consider a normal form query Q. The witness 
basis for a singular value t with respect to Q and D, denoted as Wg^Dit), is: 

(1) If Q is of the form Qi U ... U then Wg^oit) = Wg^^oit) U ... U Wg^^oit)- 

(2) If Q is of the form {e \ po € eg, ...,Pn G e„, condition}, let S' be the set of all 
valuations on the variables of Q such that “where” clause of Q holds under 
each valuation in Then, Wg^oit) = {|polb LI ... U |p„]^ | ^ g iF, t = |e|v,}. 
Note that (0 < i < n) is a database constant since Q is in normal form. 

(3) Otherwise, Wg^nit) = {}. 

More generally, for any well-formed query Q, we can define the witness basis 
by extending (2) as follows. We partition the set of pi G Ci in the “where” 
clause of Q into two parts: S\ = {pi \ Ci is the database constant D} and 
S 2 = {{PijGi) I Pi is a pattern matched against a query Cj}. We use pj,...,p} 
to denote the members of and (pq, Cg), ..., (p^, e^) to denote the members of 
82 - Let S' be the set of all valuations on the variables of Q such that for each 
valuation in S', “where” clause of Q holds. Then Wg^nit) = {P 1 UP 2 | V' G E 
bh,Pi = IPoh L L lplli,,P 2 = U ...Uwm where Wi G WV(e 2 )_£,(|pf],/,)}. 
For a compound value t, the witness basis is the product of individual witness 
basis of singular values making up t. That is, consider t = ti Li ... Li tm where 
each ti is singular. Then Wg^oit) = {wi U ... U Wm \ wi G Wg^D{ti)}- □ 

The general definition above looks for patterns which are matched against 
the database constant D and patterns which match against queries. The former 
is collected together as part of the witness under P. If the generator is a nested 
query, we inductively look for the witness basis of these patterns under the 
valuation and later combine the results together by taking the product. Next, 
we show that the witness basis of a well-formed query is in fact the same as the 
witness basis of its normal form. 

Lemma 1. li Q Q' via the rewrite system TZ, then for any value t in the 
output of Q{D), Wg^D{t) = Wg^^oit)- 

Computing a Witness Basis. We next show a procedure for finding Wg pit) 
where t is a singular value and Q is a query in normal form. That is, Q = 
Qi U ... U Qn and each Qi = {ci \ pn G D, ...,Pik, G D, conditioui}. To look for 
members of the witness basis of t, we need to search for valuations on variables in 




Why and Where: A Characterization of Data Provenance 



325 



each Qi that will produce t. For those valuations that produce t, the deep union 
of pii to piki under each valuation is returned as a result. However, instead of 
searching the witness basis directly, we produce a query Q', which when evalua- 
ted, will generate the witness basis. The “where” clause of Q' is the same as Qi 
and the “collect” clause contains an output expression which is the deep union 
of all patterns in the “where” clause of Qi placed on an edge. This is to prevent 
inter-mixing with other members of the witness basis. The algorithm for genera- 
ting Qi from Qi is described below. Why{t,Q, D) is simply Why(t, Qi, D)U...U 
Why(t, Q„, D). Ip is a, valuation from to t such that ip{ei) = t. This technique 
is sound and complete in the sense that the set of witnesses in WQ^oit) is the 
same as the set of witnesses returned by Why(t, Q, D){D). 

Algorithm: Why(t,Qi, D) 

Let A denote the “where” clause of Qi. 

Let A' denote the deep union of patterns in A. 
if there is a valuation ^|) from d to t then 

Return the query “where 'p{A) collect tp{A'):C” 

(For simplicity, we did not serialize the output expression on the edge.) 
else 

No query is returned 

end if 

Theorem 2. (Soundness and Completeness) Let Q be a query in nor- 
mal form and t be any singular value in the output of Q{D). Then Wq ,D{t) 
= \Nhy{t,Q,D){D). 

A Comparison. We point out here that our notion of witness basis coincides 
with the derivation of a tuple in [7] for SPJU queries where the general case of 
theta-join is considered. The details are deferred to the full version. 

5.2 Minimal Witness Basis 

Observe that a witness for a value is invariant under all equivalent queries but 
the witness basis is not. We show next that a subset of the witness basis, called 
minimal witness basis, is in fact invariant under queries with only equalities. 

Definition 7. (Minimal Witness, Minimal Witness Basis) A value s is 
a minimal witness for a value t with respect to Q if Vs' IZ s t % Q(s'). The 
minimal witness basis for a value t with respect to a query Q and database D, 
denoted as MQ^jj{t), is a maximal subset of such that Vm G Mq ,D{t) 

£ Wq^£)(t) such that w O m. □ 

Example: According to the query Qi, introduced earlier, the witness shown in 
Section 5 is a minimal witness for {name: "G.F. Handel"} .born: 1685 while the 
witness shown in Section 5.1 is not. 

Theorem 3. (Invariance of Minimal Witness Basis under Equivalent 
queries) If Q and Q' are two equivalent well-formed queries with only equality 
conditions and t is contained in Q{D) and Q'{D), then Mq ,D{t) = Mqyoit). 
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The proof of this theorem is based on a homomorphism theorem which shows 
that for the class of well-formed and well-defined queries (with equality condi- 
tions), query containment is equivalent to the existence of a homomorphism 
between the queries. Based on the ideas in [11], we can also extend this theorem 
to certain subclasses of queries with inequalities. Thus the invariance property 
of minimal witness basis in fact holds across this larger class of queries. 

5.3 Cascaded Witnesses (Query Composition) 

Suppose we have some data sources - a mixture of materialized views (V) and 
actual databases (D) - and a query written against these sources. We may choose 
to find the witness basis for a value with respect to these sources (our witnesses 
will therefore consist of values from both V and D) and subsequently finding 
the witness basis of those components taken from the views so that eventually, 
witnesses in the witness basis consist of only values from D. We show next that 
the witness basis obtained in this manner is the same as first “composing out” 
the views in the query using the composition rule in our rewrite system TZ and 
obtaining the witness basis according to the rewritten query. In fact, this result 
is an important special case of Lemma 1 where views are nested queries not 
sharing any variables with the outer query block. 

Theorem 4. (Unnesting of Witnesses) Let U be a set of databases, U be a 
query written against D and Q he a, query written against D and V . Then for 
a value t in Q{D, V), WQ/^oit) = {w U w' | (w U v') G fUQq£)_y(£i)}(t), u' is the 
value taken from view V{D), w' G Wv,dW)} where Q' is the rewritten query 
via our rewrite system TZ in which view V has been “composed out” . 

6 Where-Provenance 

So far we have explored the issue of what pieces of input data validate the 
existence of an output value, for a given query. We now focus on identifying 
what pieces of input data helped create various values that appear in the output. 
The where-provenance of a specific value in the output is closely connected to 
the witnesses for the output in that only some parts of any witness are used 
to construct a specific output value. For instance, in the example described 
in Section 4, the output value “1685” in {name:"G.F. Heuidel"} .born: 1685 
depends only on Composers . {naune : "G . F . Handel"} .born: 1685 in the input. 
We refer to the path Composers . {name : "G.F. Hauidel " } . born in the input as 
the where-provenance of this output value. This informal description already 
suggests an intuitive procedure for determining the where-provenance of any 
specific value in the output: determine which output variable was bound to this 
specific value, and then identify the pieces of input data that were bound to this 
output variable. However, this intuition is fragile and there are many difficulties 
involved in formalizing this intuition as illustrated by the sequence of examples 
below. Consider the following two equivalent queries that look for employees 
with a salary of $50K : 
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Qi 



where Emps.{Id:®}.salary:$50K G D, 
collect {Id:x}.salary:$50K 



where Emps.{Id:a:}.salary:j/ G D, 
y = SSOif 

collect {Id:a;}.salary:j/ 



Suppose we wish to determine the where-provenance of $50K in an output 
tuple. In case of query Qi, there is no variable in the collect clause which the 
value $50K can be identified with. The where-provenance of this value in Qi 
is the query itself since the value is hard- wired into the query output. For Q 2 , 
the output variable y can be associated with the value $50K and can be used 
to identify what contributed to this value. By convention, we will consider the 
where-provenance of a specific value in the output to be defined only if it can 
be associated with one or more variables in the output expression of a query. 
Otherwise, we will ascribe the where-provenance of a value to the query itself. 
This example illustrates that the notion of where-provenance is hard to keep 
invariant over equivalent queries in general. 

Our next example shows that when multiple pieces of data may simulta- 
neously contribute to a specific value in the output, it may be difficult to identify 
all the pieces. 



O 3 = where Emps.{Id:®}.salary:y G D, 
Emps.{Id:®}.bonus: 2 / G D 
collect {Id:a:}.new_salary:t/ 



Q 4 = where Emps. {Id:®}. salary:?/ G D, 
Emps.(Id:®}.salary :2 G D, 
Emps.{Id:®}.bonus :2 G D 
collect {Id:®}.new_salary:y 



In case of Q 3 , the value associated with any new_salary component in the ou- 
tput originated from both the salary and bonus components of the corresponding 
employee. This is easily identified by tracking the output variable y through the 
query. But in Q 4 , which is equivalent to Q 3 , on any input data where salary and 
bonus are atomic values, one needs to recognize that z is always forced to agree 
with y and hence where-provenance is determined by y and z together. This 
suggests that in general the syntactic structure of a query may not suffice for 
identifying the where-provenance. Even in cases where syntactic analysis alone 
may work, this issue becomes rather difficult to handle once we consider nested 
queries. Consider the following two equivalent queries: 

Qe — where Jl.x.y : z G D, 

S.t.u G D, 

-where R.x.y : z G D, 

' ^ ^ collect x.y : z ’ 

collect {x.y : z,t : u} 

When applied to an input database |R . 1 . 2 : 3 , S . 1 . 2 : 3}, these queries pro- 
duce as output 1.2:3. The where-provenance of value 3 in the output is (R . 1 ; 2 , 
S . 1 : 2} in case of query Q 5 . In contrast, where-provenance of the same value with 
respect to Qe requires one to identify that u binds to y : z via the nested query. 
Then, the where-provenance is given by |R. 1 :2,S. 1 :2| in this case as well. 



Q 5 = where R.x.y : z G D, 
S.x.y : z G D 
collect x.y : z 



A Syntactic Approach. The examples above highlight that for general queries, 
where-provenance is not invariant over the space of equivalent queries, and that 
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a purely syntactic characterization of where-provenance is unlikely to yield a 
complete description of the where-provenance. However, we use the syntactic 
approach and identify a restricted class of queries referred to as traceable queries, 
for which where-provenance is preserved under rewriting. Our approach is based 
on formalizing our initial intuition of using variables in the output expression of 
a query as a means of identifying the where-provenance of a value. Specifically, 
for each successful valuation of the query, we systematically explore the pieces 
of input data contributing to the identified output variable; and we refer to this 
as the derivation basis of the output value. To determine where-provenance of a 
value resulting from a traceable query, it suffices to work with the normal form 
of the query. Once a query is in normal form, a straightforward procedure can 
be used to compute the derivation basis of a given value. 

Paths. To identify the where-provenance of a value in our tree of values, we need 
to extend our notion of paths. We augment our syntax for paths with For 
example, to refer to the value {name : " J . S . Bach"} which is a value on the edge 
of Composers relation, we could use the path Composers . (name : " J . S . Bach"}7„. 
To refer to the value "J.S. Bach", we could use the path Composers. 
{name:"J.S. Bach"}7,name. 

We show next the definition of derivation basis (where-provenance) for que- 
ries in normal form. Informally, the derivation basis for l:v finds a variable x in 
the output expression that will generate v. This can be done by partially mat- 
ching l:v against the output expression e. All the paths to x in the patterns of Q 
are then determined. Then, for any valuation that satisfies the “where” clause, 
the valuation of the patterns in the “where” clause will form the witness, and the 
valuation of the paths that point to x will be the where-provenance of l:v with 
respect to this witness. Altogether, they form the derivation basis of i.v. We refer 
to the procedure that computes the derivation basis of i.v as Where{i.v,Q,D) . 
It is similar to Why(t, Q, D) in that we generate a query which when applied to 
D will produce the derivation basis. The “where” clause of the generated query 
is the “where” clause of Q and the “collect” clause of the generated query emits 
two things: the patterns and the paths pointing to x in the “where” clause of Q. 

Definition 8. (Derivation Basis) Consider a normal form query Q. The de- 
rivation basis for i.v where v is an atomic value, denoted as : v) with 

respect to Q and D, is defined as below: 

(1) If Q = Qi U ... U Qn then T'q,d(/ : v) = FQ^^oil ■ v) U ... U : v). 

(2) If Q has the form (e | po G co,...,Pn G Cn, condition}, let F be the set of 
valuations on the variables of Q such that the “where” clause of Q holds 
under each valuation and ip{e) contains i.v. For each ip £ F, let denote 
the path in e that points to a variable such that there exists p' and p" 
so that I = p'.p" and V'fe,^) = p' and ip{x,p){p") = v. Then, Fq^d{ 1 : v) = 
KIpoI^ U ... U |p„]^, S) \ijj £'I',S = {■i/'(p').p" I p' is the path that points to 
variable x.^ in pattern Pi, 0 < i < n}}. 

(3) Otherwise, FQ^oil '■ v) = {}• 

More generally, the derivation basis of i.v where u is a compound value is 
defined to be the derivation basis of all possible (path,value) pairs p':v' such that 
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p':v' points to a value in v. The derivation basis for multiple (path, value) pairs is 
defined to be the product of the derivation basis of individual (path, value) pairs. 
That is, rQ^D{Pi-Vi,P 2 -V 2 ) = ^q,d(pi:wi) * rQ^D{P 2 -V 2 ) = {(^i LI W 2 , Pi U P 2 ) \ 
{wi,Pi) € Pq^d{P1'Vi),{w2,P2) G Pq,d{P2-V2)}- □ 

We omit the definition for queries in the general form and remark that the 
main difference is that it looks for the derviation basis inductively for patterns 
matched against nested queries. We show next that in dealing with the derivation 
basis for the class of traceable queries, we can restrict our attention to the 
derivation basis corresponding to their normal forms. 

Definition 9. (Traceable Queries) A well-defined query Q is traceable if (a) 
each pattern in the query matches either against some database constant or 
against a subquery, (b) every subquery in Q is a view which does not share any 
variables with the outer scope (c) only a singular pattern is allowed to match 
against a subquery and (d) this pattern and output expression of the subquery 
consist of a sequence of distinct variables (variables do not repeat) and have the 
same length. □ 

Example: The first query below is not traceable because the variable u is being 
used in the inner query (this violates condition (b)). The second query is not 
traceable because an expression {y.w} is used in the pattern sequence (this vio- 
lates condition (d) where each expression in the sequence can only be a variable). 



where x : u G D, 

f where u : v £ D 
V : z £ I 

collect u : v 

collect X : z 



where x \ y £ D, 

{y:w}:z£ 
collect X : y 



/ where u : v £ D 
y collect u : v 



Proposition 1. If Q is a traceable query and Q Q' via rewrite system TZ, 
then Q' is a traceable query. 



Proposition 2. For the class of traceable queries, \i Q Q' via rewrite system 
TZ, then for any l:v in the output of Q{D), Pq : v) = Pq£d{ 1 ■ v). 



7 Conclusions 

We have described a framework for both describing and understanding pro- 
venance of data in the context of SPJU queries and views. Data provenance 
is examined from two perspectives, namely (1) Why is a piece of data in the 
output?, and (2) Where did a piece of data come from? 

We have taken a syntactic approach to understanding both notions of prove- 
nance, and we have described a system of rewrite rules in which why-provenance 
is preserved over the class of well-defined queries and where-provenance is pre- 
served over the class of traceable queries. 

One interesting direction for future work is to identify necessary and sufficient 
conditions for the class of well-defined queries. Another interesting direction 
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is to study how additional constraints on the input instances, e.g., functional 
dependencies, can help us obtain a more complete description of the where- 
provenance of a piece of data. 
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Abstract. XML data is often used (validated, stored, queried, etc) with 
respect to different types. Understanding the relationship between these 
types can provide important information for manipulating this data. We 
propose a notion of subsumption for XML to capture such relations- 
hips. Subsumption relies on a syntactic mapping between types, and can 
be used for facilitating validation and query processing. We study the 
properties of subsumption, in particular the notion of the greatest lo- 
wer bound of two schemas, and show how this can be used as a guide 
for selecting a storage structure. While less powerful than inclusion, sub- 
sumption generalizes several other mechanisms for reusing types, notably 
extension and refinement from XML Schema, and subtyping. 

1 Introduction 

XML [5] is a data format for Web applications. As opposed to e.g., relational 
databases, XML documents do not have to be created and used with respect 
to a fixed, existing schema. This is particularly useful in Web applications, for 
simplifying exchange of documents and for dealing with semistructured data. 
But the lack of typing has many drawbacks, inspiring many proposals [2,3,4,10, 
12,23,24,33] of type systems for XML. The main challenge in this context is to 
design a typing scheme that retains the portability and flexibility of untyped 
XML. To achieve this goal, the above proposals depart from traditional typing 
frameworks in a number of ways. First, in order to deal with both structured 
and semistructured data, they support very powerful primitives, such as regular 
expressions [2,10,26,33,28] and predicate languages to describe atomic values [2, 
6,10]. Secondly, documents remain independent from their type, which allows the 
same document to be typed in multiple ways according to various application 
needs. These features result in additional complexity: the fact that data is often 
used with respect to different types, means that it is difficult to recover the 
traditional advantages (such as safety and performance enhancements) that one 
expects from type systems. To get these advantages back, one need to understand 
how types of the same document relates to each other. 

In this paper, we propose a notion of subsumption to capture the relati- 
onship between XML types. Intuitively, subsumption captures not just the fact 
than one type is contained in another, but also captures some of the structural 
relationships between the two schemas. We show that subsumption can be used 
to facilitate commonly used type-related operations on XML data, such as type 
assignment, or for query processing. 
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We compare subsumption with several other mechanisms aimed at reusing 
types. Subsumption is less powerful than inclusion, but it captures refinement 
and extension, recently introduced by XML Schema [33], sub typing, as in tra- 
ditional type systems, as well as the instantiation mechanism of [10,32]. As a 
consequence, subsumption provides some formal foundations to these notions, 
and techniques to take advantage of them. 

We study the lattice theoretic properties of subsumption. These provide tech- 
niques to rewrite inclusion into subsumption. Notably we show the existence of 
a greatest lower bound. Greatest lower bound captures the information from se- 
veral schemas, while preserving the relationship with them, and can be used as 
the basis for storage design. 

Practical scenario. To further motivate the need for a subsumption mecha- 
nism for XML, consider the following application scenario. In order to run an 
integrated shopping site for some useful product, such as mobile phone jammers, 
company “A” accesses catalogs from various sources. The first catalog, on the 
left below, is taken from company “SESP” [22], while the second, on the right, 
is extracted from miscellaneous pages. 



<products> 

<jainmer> 

<company>SESP</ company> 

<name>VHP Jaimner</name> 
<price><onrequest/x/price> 
<caseXtype>Mobile Attache Case</type> 
</ caseX/ j aininers> 

< jammer> 

<compatiy>SESP</ company> 

<name>Full Milspec. Portable 

High Power (HP) Jairaner</name> 
<priceXonrequest/x/price> 
<caseXtype>Rugged military 

type case</typeX/case> 
<booster><range>lkm</rangeX/booster> 
<supplement>39</supplementx/jaimiier> 



<products> 

<jaimner> 

<name>Static HP Jammer</name> 

<priceXonrequest/x/price> 

<caseXtype>metal</type> 

<size> 180x180x80mm 
</sizeX/caseX/ j ammer> 

<jammer> 

<company>JamLogic</company> 
<name>Personal Jammer </name> 
<priceXonrequest/x/price> 
<input>Digital/Analog</ input> 
<warranty>2 years</warranty> 
</jammer> 

<jammer> 

<name>Cell-Phone Jammer</name> 
<price>749</priceX/jammer> 



Company “SESP” only sells high power jammers, and provides precise infor- 
mation about their products as the SESP schema, given on the left hand side 
below^. This schema indicates that the SESPcatalog (we write types in upper 
case and element names in lower case), is composed of an element with name 
products, which has 0 or more children of type HPJammer (’*’ stands for the 
Kleene star). HP Jammers have a company sub-element which is always "SESP", 
a name, etc., and may have a booster option with a supplement cost. On the 
right-hand side is the schema used by company “A”. Because it accesses jam- 
mer information from many places, it supports a more general description where 



^ Note that we will write some of the examples using the concrete schema syntax 
developed for the YaT System [10] 
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Jammers might not have a company information, and may have any kind of 
Option, with or without a supplement. 



SESPCatalog := products *HP Jammer; 

HPJammer := 

jammer [ company [ "SESP" ] , 
name [ String ] , 
price [ Int I onrequest ] , 
case [ type [ String ] , 

?size [ String ] ] , 

? (booster [ range [ Int ] ], 
supplement [ Int ] ) ] ; 



IntegratedCatalog := products 
* Jammer; 

Jammer := 

jammer [ ?company [ String ] , 
name [ String ] , 
price [Int I onrequest] , 
♦(Option, 

Tsupplement [ Int ] 

) ] ]; 

Option := Symbol *Any; 



Because it knows precisely the type of its data, company SESP can support 
more efficient storage (using, for instance, techniques in [14,18,31]), with fast 
access to the narnie, price and case information. But the fact that company “A” 
assumes a different type for the same data results in a mismatch. Verifying that 
type SESPCatalog is included in type IntegratedCatalog allows company “A” 
to make sure the information provided by SESP will conform to the structure 
expected by the application. However, this will not help in performing further 
operations, such as: actually assigning types of the integrated schema to elements 
of the SESP document, or understanding that the narnie and price elements 
can be efficiently accessed using the storage used by company SESP. Doing so 
requires to understand that the name and price in the Jammer type are related 
to the name and the price elements in the HPJcmimer type. We shall see that 
subsumption allows one to understand this relationship and to take advantage 
of it. 

Another important use of typing is to support better query processing. To 
find all jammers that have a two years warranty, one can write the following 
YaTl [10,16,11] query: 

define q($x) = make $n 

match $x with product s/jammer/-[ name/$n, 

warranty/$w } 

where contains ($w, "2 years"); 



whose input type is: 



q_type := products * jammer [ *(Name I Warranty I Other) ]; 

Name := name ♦ Anyl; 

Warranty : = warranty ♦ Any2 ; 

Other : = ! name ! warranty * Any ; 

Anyl := true [ Any* ] ; Any2 := true [ Any* ] 

where ! stands for tag negation, i.e., any tag other than namie and warreoity. 

Company “A” might wish to support queries on all Jamimers, but more ef- 
ficient access for this query, i.e. for products with a warranty. The relational 
approach [30] would be to use a specific access structure for the warranty field, 
but the integrated schema does not mention it. We will see that the greatest 




334 



G.M. Kuper and J. Simeon 



lower bound of the query type and the integrated schema is a new schema (with 
an explicit warranty field) that can be used for storage design, while the relati- 
onships with the original schemas are preserved through subsumption. 
Organization of the paper. Section 2 introduces the type system we will 
use in the rest of the paper (essentially that of [2]) and the notion of type 
assignment. Section 3 defines subsumption, investigates its properties and its 
use for validation. Section 4 compares it to other relations on types, such as 
inclusion, refinement and extension in XML Schema, etc. Section 5 studies the 
greatest lower bound, the corresponding lattice, and how this can be used to 
bridge the gap between inclusion and subsumption. Section 6 discusses how one 
can take advantage of subsumption for storage and query processing. Section 7 
summarizes related works and indicates directions for future work. 

2 Data Model and Type System 

Data model. The data model, based on ordered labeled trees with references, 
is similar to other previously proposed models [10,15,25,28]. O denotes a fixed 
(infinite) set of object ids and £ a fixed set of labels. References are modeled as 
a special type of node, that is labeled with a distinguished symbol in £ and 
has exactly one child. The root of the database is treated specially: A database 
is a tree with a root “a” , which has no label, and cannot be referenced by any 
node. (The reason for the special treatment of the root is explained later.) 

Definition 1. A database is a structure D = {Od, labelo, childrenn), where 

1. Od C O; 

2. label D is a mapping from Od to £; 

3. children D is a mapping from Od U {4:^} to Uj>oO)j; If label d{o) = &, then 
children{o) € 0\^; 

4- The structure that we obtain by considering only children of non-reference 
nodes (nodes with a label other than “&”) is a tree. 



Example 1. The upper part of Figure 1 is a (partial) representation for the 
Jammers document from Section 1 and would correspond to the following struc- 
ture. D = {O D 1 label D , children d) , where Od = { 01 , 02 ,...}, children{A) = 
[. . . ,oi, . . .], and 

label(oi) = jammer children{oi) = [on; 012; 013; 014] 
label{oii) = company children{oii) = [om] 
label\oiii) = “SESP” children{oin) = [ ] 



Type system. We adopt the type system of [2,25], where predicates are used 
to describe labels and regular expressions are used to describe children. Note 
though that we do not handle unordered trees, and that we model references 
in a slightly different way. Also, we choose not to use XML Schema [33], which 
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Fig. 1. Type assignment and subsumption mapping 



is more a user syntax for types than a model, but we will explain later on how 
subsumption can be used in the context of XML Schema. 

Let T be a fixed, infinite, set of type names, and V a fixed set of label predi- 
cates, which is closed under disjunction, conjunction, and complementation. We 
use T, t' etc., to denote elements of T. Regular expressions over T are of the 
form e, r, &r, {Ri \ R 2 ), or R\, where R\ and i ?2 are regular expres- 

sions, and T &T- L{R) denotes the language defined by the regular expression 
R, in which &t is treated as a single symbol. 

Definition 2. A type schema is a structure S = {Ts, predicate g,reg exp g), in 
which 

1. Tg is a finite subset ofT; 

2. predicate g is a mapping from Tg to V with the property that for each t, 
either predicate g{T) = {&}, or h ^ predicate g{r); and 

3. regexp g is a mapping from Tg U {A} to regular expressions over Tg. Whe- 
never predicate g^r) = {&}, regexpg{r) must be of the form ti | • • • | r„. 

For convenience, we will sometimes describe schemas as r 1 — >■ p; r, where p 
and r are the predicate and regular expression corresponding to r. We write 
predicate{T) = true to mean that it is satisfied by all tags except - the re- 
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strictions on the interaction between reference and non-reference types guarantee 
that this will never cause any confusion. 

Example 2. The middle part of Figure 1 is a (partial) representation of the 
schema for HP jammers and would correspond to the structure (Teat, Peat, ?'cat), 
where Teat is the set {catalog, HPjammer, Jn, J12, Jis, Ji 4 , Jiii, ■ • • , J1421}, 
regexp{A) is catalog, and 

(HPjammer) = {jammer} regexp jammer) — Jii, J12, Jis, J14 



predicate^^^[J 13) = {price} regexp^^^{Ji3) = Ji3i|Ji32 

predicate = {0, 1, . . .} regexp ^^^(3 131) = £ 

predicate^^^(J i 3 i) = {onrequest} regexp^^^{3 132) = £ 



Typing and Type Assignment 

Definition 3. Let D he a database and S a schema. We say D is of type S 
under the type assignment 9, and write D :g S iff 9 is a function from Od U {a} 
to Ts U {A} such that: 

1. 9{a) = A, 

2. for each o € Od, predicate g {9 (o)) ^ lahel{o), and 

3. for each o € Od U {a} with children{o) = [oi,...,o„], 0(oi) . . . 0(o„) G 
L{regexpg{9{o))). 

We say that D is of type S, and write D : S, iS D :g S for some 9. Models(5') 
is the set of databases of type S, i.e., {D \ D : S'}. It is immediate that D :g S 
and D' C D (i.e., Od' Q Od and the corresponding labels and children are the 
same) imply D' S. 

Example 3. Figure 1 illustrates the type assignment between the Jammer docu- 
ment and the HP Jammer schema, corresponding to the following 9\ 

9{o\) = HPjammer 9{oi\) = Jn 

^(0l3)=Jl3 ^*(0l3l) = Jl32 

^(oi4)=Jl4 ^(oi4l)=Jl41 



Type assignment is the most important information coming out of the typing 
process (also called validation in the XML world). Once computed, it allows the 
system to efficiently obtain the type of a given data whenever needed, e.g., in 
order to chose the storage or take query processing decisions at run time. Note 
that type assignment information is logically provided in the XML Query data 
model [15] by the Def_T reference^. 

However simple, our type system is powerful enough to capture most of the 
other proposals, including XML Schema. It can be used to represent existing 

^ http : //www. w3 . org/TR/query-datamodel/#def _t 
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type information from heterogeneous sources [10,32,2] or to describe mixes of 
structured and semistructured data. The two following remarks will also play an 
important role in the rest of the paper. 

Remark 1. Any is the schema that such that D : Any holds for any database D\ 

^ f (Tanytype | t”anyref) 

^anyref ' ^ (^anyref I ^anytype) 

Tanytype ' ^ trUG, (Tanytype | Tanyref) 



Remark 2. For each database D, one can define a schema S that types this 
database only, by taking Ts such that it contains exactly a type name r for each 
object o in Ojj, with 9{o) = r, predicate g{r) = {lahelo{o)} and regexpg{r) = 
childreriDio). Then, D \g S and Models(S') = {D}. 

We will write S[o] the schema that types the database D only. We will call 
None the schema that types the empty database only. None has Tuone = 0 and 
regexp^^^^{A) = e. 

3 Subsumption 

Intuitively, subsumption relies on a mapping between types (playing a role simi- 
lar to type assignment for typing) and on inclusion between regular expressions 
over these types. 

Definition 4. Let S and S' he two schemas. We say that schema S subsumes 
S' under the subsumption mapping 6 , and write S <e S' , iff 9 is a function 
from Tg U {A} to Tgi U {A} such that: 

1. 9{t) = a iffr = A. 

2. For all r € Tg, predicate g{T) C predicate g, {9 (t)). 

3. For all t G Tg U {A}, 9{L{regexpg{T))) C L{regexpg,{9{T))) (where 9 is 
extended to words in the language in the natural way) 

We write S < S' if there exists a 9 such that S S, and S' « S" for (S ^ 
S') A (S' ^ S): this is clearly an equivalence relation. 

Example Figure 1 illustrates the subsumption mapping between the Jammer 
and HPJammer types, corresponding to the following 9' \ 

0'(HPJammer)=Jammer ^^'(Jii)=J'ii 

0'(Jlll)=Jlll 0 '(Ji3)=J'i3 

^^(Ji 4 )=Option 6 *'(Ji 4 i)=Any . . . 

The following propositions cover the elementary properties of subsumption. The 
first states that type checking is a special case of subsumption, and is a direct 
consequence of Remark 2. The second and third propositions state the transi- 
tivity of subsumption, and more importantly of their underlying subsumption 
mapping, giving the means to propagate relationships between types. 
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Proposition 1. Let S, S' , S" be three schemas, and D he a database. 

1. D :g S iff S'[£)] S. 

2. S S' and S' S" imply S S" . 

3. If D S and S S' , then D '.g^oSz S' . 

Using subsumption for validation. An important consequence of Prop. 1 is 
the ability to take advantage of subsumption for computing type assignments. 
Intuitively, if one has a type assignment for a given database, and a subsumption 
mapping from the original type to the new type, the new type assignment can 
be obtained by composing the mappings rather than by evaluating the type 
assignment from scratch. 

This is especially useful as in most practical scenarios, including the one we 
sketched in Section 1, XML data is generated from a legacy source, along with 
its original schema (SESPCatalog). If instead of checking inclusion, company 
“A” computes subsumption between the two schemas, it obtains the new type- 
assignment at the same time. This approach has a number of advantages. First, 
the size of schema is orders of magnitude smaller than the data. Secondly, this 
can be done at compile time, without requiring to access the whole data. 

Example 5. For instance, assume Company “A” runs a query to the SESP 
store that returns the jammer o\. We know from 9 in Example 2 that o\ has 
type HP Jammer and from 9' in Example 4, that HP Jammers correspond to 
Jammers in the integrated schema. This gives us directly that the type of o\ 
with respect to the integrated schema is Jammer (see also Figure 1). 

4 Comparison with Inclusion, Extension, et al. 

To get a better understanding of the scope of subsumption, we now compare it 
to other relations over types, notably, inclusion, XML schema’s mechanisms of 
refinement and extension, subtyping, and the instantiation mechanism of [10]. 
Inclusion. Type inclusion is defined in terms of containment of models. 
Definition 5. S Q S' Zjff Models (A) C Models(S"). 

Of course, subsumption provides additional information compared to inclusion 
because of the subsumption mapping. A natural question is: can one always find 
a subsumption mapping between two types for which inclusion holds. 

Proposition 2. Let S and S' be two schemas. Then (1) S S' ^ S C S' , but 
not conversely; and (2) S < S' ^ S Q S' , and this implication is proper. 

Proof. (2) is trivial. (1) and (3) are direct consequences of Remark 2. To see why 
the implications are proper, consider the following type schemas: 

S, S' Ti I— >■ {a}; e 

T 2 {a}; e 
S A !->• T * , T2 
S' A I— >■ Ti , T2 
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Then both S and S' type precisely those databases for which children{A) are all 
leaves with tag “a”, but neither S < S' or S' < S. 

As shown in [20,21], type inclusion can be used to type-check XML languages. 
Proposition 2 implies that some queries might type-check even though a sub- 
sumption mapping does not exist. In such a case one might not be able to take 
advantage of subsumption. Fortunately, we will see that there are many prac- 
tical cases for which a subsumption mapping between types exists, including: 
when they are defined through XML Schema’s refinement or extension mecha- 
nisms or when they are exported from a traditional type system with subtyping. 
Moreover, we will show (Proposition 1) that if S” C S', then one can construct 
a schema S' equivalent to S for which S" ^ S' . 

Extension and refinement in XML Schema. XML Schema: Part 1 [33] 
defines two subtyping-like mechanisms, called extension and refinement, aimed 
at reusing types. For obvious space limitations, we cannot explain all the complex 
features of XML Schema, so our presentation will rely on a simple modeling of 
these two mechanisms. In a nutshell, extension allows to add new fields at the 
end of a given type, while refinement provides syntactic means to restrict the 
domain of a given type. 

Example 6. The following XML Schema declaration defines a Stated-Address 
by refining an Address to always have a unique state element and US-Address 
by extending Stated-Address with a new zip element. 

<complexType name="Address"> 

<element name=" street" type="string"/> 

<element name="city" type="string"/> 

<element name="state" type="string" min0ccurs="0" maxOccurs="l"/> 

</ complexType> 

<complexType name="Stated-Address" base="Address" 

derivedBy="ref inement " > 

<element name=" street" type="string"/> 

<element name="city" type="string"/> 

<element name="state" type="string" minOccurs="l" maxOccurs="l"> 

</ complexType> 

<complexType name="US-Address" base="Stated-Address" 

derivedBy="extension"> 

<element naine="zip" type="positiveInteger"/> 

</ complexType> 

In our model, these three types would be defined as follows: 

regexp (Address) = Street, City, State?, Tanytype* 
regexp (Stated- Address) = Street, City, State, Tanytype* 
regexp(US-Address) = Street, City, State, Zip, Tanytype* 

The type Tanytype, 38 defined in Remark 1, indicates the ability to have ad- 
ditional fields. Note that the subsumption relationship holds US-Address ^ 
Stated- Address -< Address. 
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Proposition 3. A type t' derived by extension or refinement from a type r is 
such that t' <T. 

Proof Sketch: Refinement corresponds to adding a field at the end of a given 
type. This corresponds to regular expressions of the form: regexp = ti, . . . ,r„, 
Tanytype*, and rcgcxp' = Ti, . . . ,T„,T„+i,Ta„ytype* for which subsumption holds 
with = Tanytype- 

Extension can be obtained by restricting a datatype, which yields inclusion 
between predicates. minOccur and maxDccur restrictions corresponds to regular 
expressions of the form: 

regexp = (r, r, . . . , t), t?, . . . , t? and regexp' = (r, r, . . . , t), r ?, . . . ,t? 

n m n' m' 

Subsumption holds when n < n' and (n+m) > {n' + m'). Union type restrictions 
correspond to regular expressions of the form regexp = ti | . . . |r„| . . . |r„+m, and 
regexp' = ti| . . . |t„ for which subsumption holds. The result follows by induction. 

Subtyping. The literature proposes a large number of different mechanisms 
called or related to subtyping [8,27,29]. Basic subtyping usually relies on two 
mechanisms: additions of new attributes in tuples (e.g., { name: String; age: 
Int } <: { name: String }) and restrictions on atomic types (e.g., Int <: 
Float). The last mechanism is captured by predicate restrictions in our context, 
while the first is captured by adding Any* types when modeling tuples^. 

Instantiation. [10] proposes a notion of instantiation that corresponds to cer- 
tain restrictions over types. This mechanism allows: restrictions on the label 
predicates, restrictions on the arity of collections (similar to the minOccur and 
maxOccur restrictions in XML schema), and restrictions on the unions. As for 
XML Schema, these restrictions yields only types for which subsumption holds. 

5 Greatest Lower and Least Upper Bound 

Let S and S' be two schemas. We consider equivalence classes of schemas with 
respect to subsumption [S”],^, ordered by and show that this is a lattice. We 
first define the greatest lower bound, which intuitively is a schema describing 
the type information that is common to the given schemas. 

We shall assume that whenever r and r' are in T, so is the symbol r □ r'. 
We need to define appropriately intersection of regular expressions: our regular 
expressions are over type names, but the intersection should be over the seman- 
tics of the types, not the names. For example, if the regular expressions are r* 
and (t 2 ,T 3 ), the intersection will be ((ti □ T 2 ), (ti □ T 3 )). 



® Note however that our type system does not capture the unordered semantics of 
tuples. 
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Definition 6. Let S and S' be two type schemas.'^ The greatest lower bound 
S' n S" and least upper bound S U S' are the schemas with TsnS' = {t □ r' | r G 
Ts ,t' G Ts'}, TsuS' = Ts'J Ts>, and 

S n S'; A !->• regexp g{A) fl regexp g, (A) 

T I— predicate ^{t)] regexp ^{t) fl regexp gi^r') 

S U S' ; A i-i regexp 5/ ( A) | regexp g, (A) 

T I— predicate g{T)] regexp g^r) t G Tg 

T I— predicate g,{T)] regexp g,{T) t G Tg> 

Example 1. Consider the following two schemas (where Tanytype is as in the de- 
finition of the schema Any) . 

S'- A I \ (fl , Tanytype*) S : A I >■ (fanytype*! T2) 

Ti {a}; e T2 {5}; e 

S n S ! A I \ ((ti n Tanytype) i (fanytype C Tanytype)* ? (fanytype C T2)) 

Tl n Tanytype ' ^ ^ 

Tanytype C T2 I ^ ^ 

where Tanytype C Tanytype IS the Same as Tanytype up to renaming. 

The greatest lower bound of schemas requires intersection of regular expres- 
sions, that can lead to a blowup in the size of the schema but this is unlikely to 
happen in practice. 

The greatest lower bound is the best description, with respect to subsump- 
tion, of all of the type information that we have about both schemas. In parti- 
cular, if a database is typed by both S and S", it is also typed by S' □ S'. More 
generally: 

Proposition 4. 1. SuS' <S and SuS' < S'; S ^ S U S' ond S' ^ S U S'. 

2. IfS" ^ S and S" ^ S', then S" ^ Sn S'; similarly If S ^ S" and S' ^ S", 
then S US' < S". 

3. If D : S and D : S', then D ■. SU S' and D ■. S U S' . 

Theorem 1. £ = ([S],^ ,n~,U~, [None],^ , [Any],^) is an incomplete distributive 
lattice without complement. 

The next theorem is essential as it gives a relationship between the syntactic 
definitions of S □ S' and S U S' and the semantics of the respective schemas. 
The proof of this theorem relies on Remark 2, that connects typing, on which 
Models are defined, and subsumption. 

Theorem 2. For any schemas S and S', (1) Models(S □ S') = Models(S) fl 
Models(S') and (2) Models(S U S') = Models(S) U Models(S'). 

We assume for simplicity that Tg and Tg/ are disjoint. This can always be achieved 
by appropriate renaming. 
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The use of untagged roots was introduced in [2] . Our results give another, tech- 
nical, reason why such special treatment of the root is needed. Specifically, if the 
database root were allowed to be tagged, then L would not be distributive. On 
the other hand, a data model based on forests rather than trees would not work 
either, as then Models(S' U S') = Models(S') U Models(S") would not hold. 

Subsumption is weaker than inclusion, as there are schemas that are contai- 
ned in other schemas without subsuming them. For this reason, the following 
Corollary is very important: it shows that whenever a schema S is contained in 
a schema S' , S can be rewritten in an equivalent way such that S subsumes S' . 

Corollary 1. Let S and S' be two schemas such that Models(S') C Models (S"). 
Then there exists a schema S" such that (1) Models(S"') = Models(S') and 
(2) S" ^ S'. 

6 Practical Use of Subsumption 

We now come back to our example from the introduction and illustrate how 
subsumption can be helpful for storage and query processing. 

Standard relational techniques are used to design storage structures that take 
into account which queries are likely to be asked. If we take query q from the 
introduction, one might wish to find a schema S that would allow to store data 
in such a way this query is answered in an efficient way. However, if one only 
considers the integrated schema, one can only use the available information about 
Jammers. Existing techniques [14,18,31] would provide the following relational 
schema: 

jammersCjid, company, name, price); 
options(jid,att,treeid) ; 
treeCtreeid, . . . ) ; 

where the tree table is used to store any tree, playing a similar role to the 
overflow graph in [14]. 

The greatest lower bound can be used to derive a schema that includes the 
warranty attribute. After appropriate renaming of types, this is: 

Warranty_ Jammer := 

jammer [ TCompany’ , Name’, Price’, 

*( WarranityOption’ I (OtherOption’ , TSupplement ’ ) ) ]; 
Company’ := company [ String ]; 

Name’ := name [ String ]; 

Price’ := price [ Int I onrequest ]; 

WarrantyOption’ := warranty * Any; 

OtherOption’ := Iwarranty * Any; 

Supplement ’ : = supplement [ Int ] ; 

We can then use this information to store the data with a faster access to 
the warrauity attribute, using the following relational schema: 

jammersCjid, company, name, price); 
jammersCjid, warranty); 
options C j id, att , supplement , treeid) ; 
treeCtreeid, . . . ) ; 
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We then need to evaluate query q on top of this storage. The key remark is 
that YATp [10,11] uses pattern matching with type expressions. This captures 
the navigation performed in other languages [1,13]. 

Following [9], the match clause of a YaT^ query is represented by a pattern- 
matching operation called Bind. Bind matches a regular expression with the 
data, and returns a binding between variables in the query and values in the 
document. In the case of query q. Bind p[$n |^J] where 



p[$n,$w] 

Jammer 

Name 

Warranty 

Other 

Anyl 



products * Jammer; 

jammer [ *(Name I Warranty I Other) ]; 
name * ($n:Anyl); 
warrcuity * ($w:Any2); 

! name ! warranty * Any ; 

true [ Any* ] ; Any2 : = true [ Any* ] 



Most XML processors evaluate similar operations by loading the document 
in memory and parsing it according to the given filter. This can be expensive 
and does not make use of the knowledge of how the document is stored (here 
with using the relational schema above). 

Let 9 be the subsumption mapping from the type of p [$n , $w] to the greatest 
lower bound: 

0'(WarrELnty_Jammer)= Jammer 0'(Company')=Other 

0'(Ncmie')=Name 0'(Price')=Other 

0'(WarrantyOption')=Warranty 0'(DtherOption')=Other . . . 

Through 6*, we know that the values of $n 

are the values of the elements of type Name ’ in the the stored schema, hence 
how to access them using the relational engine. 

7 Related Work and Conclusion 



Typing for XML is a heavily studied problem. Existing work covers the type 
systems themselves [2,10,12,33], type checking [20,26] and type inference [25,28]. 
XML types have been used for query formulation [19], query optimization [17,9], 
storage [14,31], and compile-time error detection [20]. A notion of subsumption 
for unordered semistructured data was proposed in [6] based on a graph bisimu- 
lation. Our work extends this approach to types that involve order and regular 
expressions. Typing in XDuce [20] relies on full type inclusion. [7] describes a 
notion of containment between XML DTDs, which are less expressive than our 
type system and is based on full inclusion with tag renaming. 

There are many directions in which this work can be continued. First of all, 
while our work (and most other work in this area), uses a list model for data, for 
database applications a set semantics may be more appropriate, and therefore 
extending the results to sets (and bags) would be of interest. For applying the 
results to inheritance, as indicated above, one may want to be able to type an 
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object in multiple ways - formally this may be captured by the greatest lower 
bound, but this does not provide the intuitive semantics desired here. 

We have not discussed complexity in this paper. Typing a database is a 
special case of subsumption (where the database is itself the schema), and the 
complexity of typing is known [2] to be hard. Note, however, that complexity 
of checking subsumption is in the size of the schema rather than in the size 
the database. Furthermore, many of the problems that relate to typing become 
tractable in the case of unambiguous schemas: in our framework there are many 
possible definitions of ambiguity, such as the existence of a single typing, un- 
ambiguity up to reference nodes, unambiguous regular expressions, etc. Efficient 
evaluation of queries is one of the main motivations for this work. Many complex 
parameters must be taken into account in this context, such as the impact of 
storage structures, memory management issues, etc. To evaluate the real impact 
of subsumption, we consider an implementation of the techniques presented here 
in the context of the YaT System [10,9]. 

References 

1. S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener. The Lorel 
query language for semistructured data. International Journal on Diqital Libraries, 
l(l):68-88, Apr. 1997. 

2. C. Beeri and T. Milo. Schemas for integration and translation of structured and 
semi-structured data. In Proceedings of International Conference on Database 
Theory (ICDT), Lecture Notes in Computer Science, Jerusalem, Israel, Jan. 1999. 

3. R. Bourret, J. Cowan, I. Macherius, and S. St. Laurent. Document definition 
markup language (ddml) specification, version 1.0, Jan. 1999. W3C Note. 

4. T. Bray, C. Frankston, and A. Malhotra. Document content description for XML. 
Submission to the World Wide Web Consortium, July 1998. 

5. T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup language 
(XML) 1.0. W3C Recommendation, Feb. 1998. http://www.w3.org/TR/REC-xml/. 

6. P. Buneman, S. B. Davidson, M. F. Fernandez, and D. Suciu. Adding structure to 
unstructured data. In Proceedings of International Conference on Database Theory 
(ICDT), volume 1186 of LNCS, pages 336-350, Delphi, Greece, Jan. 1997. 

7. D. Calvanese, G. D. Giacomo, and M. Lenzerini. Representing and reasoning on 
xml documents: A description logic approach. Journal of Logic and Computation, 
9(3):205-318, 1999. 

8. L. Cardelli. A semantics of multiple inheritance. Information and Computation, 
76(2/3):138-164, 1988. 

9. V. Christophides, S. Cluet, and J. Simeon. On wrapping query languages and 
efficient XML integration. In SIGMOD’2000, Dallas, Texas, May 2000. 

10. S. Cluet, C. Delobel, J. Simeon, and K. Smaga. Your mediators need data conver- 
sion! In SIGMODT998, pages 177-188, Seattle, Washington, June 1998. 

11. S. Cluet and J. Simeon. YATp: a functional and declarative language for XML. 
Draft manuscript. Mar. 2000. 

12. A. Davidson, M. Fuchs, M. Hedin, M. Jain, J. Koistinen, C. Lloyd, M. Maloney, 
and K. Schwarzhof. Schema for object-oriented XML 2.0, July 1999. W3C Note. 

13. A. Deutsch, M. F. Fernandez, D. Florescu, A. Y. Levy, and D. Suciu. A query 
language for XML. In Proceedings of International World Wide Web Conference, 
Toronto, May 1999. 




Subsumption for XML Types 345 



14. A. Deutsch, M. F. Fernandez, and D. Suciu. Storing semistructured data with 
STORED. In SIGMODT999, pages 431-442, Philadelphia, Pennsylvania, June 
1999. 

15. M. F. Fernandez and J. Robie. XML Qnery data model. W3C Working Draft, 
May 2000. http://www.w3.org/TR/query-datamodel/. 

16. M. F. Fernandez, J. Simeon, and P. Wadler (editors). XML query languages: 
Experiences and exemplars, draft manuscript, communication to the W3C, Sept. 

1999. 

17. M. F. Fernandez and D. Suciu. Optimizing regular path expressions using graph 
schemas. In ICDET998, Orlando, Florida, Feb. 1998. 

18. M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: 
A system for extracting document type descriptors from XML docnments. In 
SIGMOD’2000, pages 165-176, Dallas, Texas, May 2000. 

19. R. Goldman and J. Widom. Data guides: Enabling query formulation and op- 
timization in semistructured databases. In VLDBT997, pages 436-445, Athens, 
Greece, Aug. 1997. 

20. H. Hosoya and B. C. Pierce. XDuce: an XML processing language. In International 
Workshop on the Web and Databases (WebDB’2000), Dallas, Texas, May 2000. 

21. H. Hosoya, J. Vonillon, and B. C. Pierce. Regular expression types for XML. 
Submitted for publication, Mar. 2000. 

22. http://sesp.co.Uk/4.htm. 

23. N. Klarlund, A. Moller, and M. I. Schwartzbach. DSD: A schema language for 
XML. In Workshop on Formal Methods in Software Practice, Portland, Oregon, 
Aug. 2000. 

24. M. Makoto. Tutorial: How to relax, http://www.xml.gr.jp/relax/. 

25. T. Milo and D. Suciu. Type inference for queries on semistructured data. In 
PODST999, pages 215-226, Philadephia, Pennsylvania, May 1999. 

26. T. Milo, D. Suciu, and V. Vianu. Typechecking for XML transformers. In 
PODS’2000, Dallas, Texas, May 2000. 

27. J. C. Mitchell. Foundations for Programming Languages. MIT Press, 1996. 

28. Y. Papakonstantinou and V. Vianu. DTD inference for views of XML data. In 
PODS’2000, Dallas, Texas, May 2000. 

29. F. Pettier. Synthese de types en presence de sous-typage: de la theorie a la pratique. 
These de doctorat, Universite Paris VII, July 1998. 

http : //pauillac . inria.fr/ -fpottier/publis/these-fpottier .ps .gz. 

30. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, 

2000 . 

31. J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. 
Naughton. Relational databases for querying XML documents: Limitations and 
opportunities. In Proceedings of International Conference on Very Large Databa- 
ses (VLDB), Edinburgh, Scotland, Sept. 1999. 

32. J. Simeon and S. Cluet. Using YAT to build a web server. In International 
Workshop on the Web and Databases (WebDB’98), volume 1590 of LNCS, pages 
118-135, Valencia, Spain, Mar. 1998. 

33. H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML schema part 
1: Structures. W3C Working Draft, Feb. 2000. 




Towards Aggregated Answers for 
Semistructured Data 



Holger Meuss^, Klaus U. Schulz^, and Frangois Bry^ 

^ CIS, University of Munich, Oettingenstr. 67, 80538 Munich, Germany 
meussOcis . uni-muenchen . de 

^ Institute for Computer Science, University of Munich, Oettingenstr. 67, 
80538 Munich, Germany 



1 Introduction 

Semistructured data [5,34,23,31,1] are used to model data transferred on the 
Web for applications such as e-commerce [18], biomolecular biology [8], docu- 
ment management [2,21], linguistics [32], thesauri and ontologies [17]. They are 
formalized as trees or more generally as (multi-)graphs [23,1]. Query languages 
for semistructured data have been proposed [6,11,1,4,10] that, like SQL, can be 
seen as involving a number of variables [35], but, in contrast to SQL, give rise 
to arrange the variables in trees or graphs reflecting the structure of the semi- 
structured data to be retrieved. Leaving aside the “construct” parts of queries, 
answers can be formalized as mappings represented as tuples, hence called an- 
swer tuples, that assign database nodes to query variables. These answer tuples 
underly the semistructured data delivered as answers. 

A simple enumeration of answer tuples is problematic for several reasons. 
First, the number of answer tuples for a query may grow exponentially in the 
size of both, the query and the database. Second, even if the number of answer 
tuples is manageable, the frequent sharing of common data between distinct 
answer tuples is no more apparent in their enumeration. 

In this article, it is first argued that enumerating answer tuples is often not 
appropriate and that aggregated answers are preferable. Then, a notion of ag- 
gregated answers called Complete Answer Aggregate ( CAA ) generalizing [25,24] 
is introduced and algorithms for computing CAAs are given. We only consider 
CAAs for semistructured data: In this context, CAAs seem particularly attrac- 
tive since they reflect the graph structure of the database and the query in a 
very natural way. It is shown that CAAs enjoy nice complexity properties: (1) 
While the number of answer tuples may be exponential in the size of the query, 
the size of the CAA is at most linear in the size of the query and quadratic in 
the size of the database; (2) the complexity of computing the CAA of a query 
depends on the query’s structural complexity (i.e. whether it is a sequence, tree, 
graph, etc.) but is independent of the structural complexity of the database. For 
tree queries, efficient polynomial algorithms are given. Besides, CAAs seem to 
be particularly appropriate for answer searching and answer browsing. 

This article is organized as follows. The need for aggregated answers and the 
basics of CAAs are illustrated with a motivating example in Section 2. Section 3 
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introduces a few preliminary notions. Section 4 gives the formal definition of 
CAAs. In Section 5, a hierarchy of query problems of increasing complexity 
is defined. In Section 6, we describe algorithms for computing the CAA and 
analyze their complexity, for each problem of the hierarchy. Query answering 
using CAAs is discussed in Section 8. Section 9 discusses related work. Section 10 
is a conclusion. For space reasons, no proofs are given. They are given in the full 
version of this paper [26] . 

2 Motivating Example 

Consider a database T> (Figure 1) on research projects offering information on 
project managers, members and publications. Assume that the database is or- 
ganized as a graph with labeled edges and nodes according to a model for semi- 
structured data [9,19,1]. Projects x and their managers z such that the string 
“XML” occurs in a title element t (at any depth) of some articles y of projects 
X are retrieved by a query depicted in Figure 1. (The restructuring facilities of 



Project 
Publications , 
Article I 

|y 

1 + 

Title I 

cont "XML” ^ t 



Manager 



Project I 



Project I 

&2i 

Publications ^ Manager p„yi,ations Manager 

> Name >1 ' ^ - Name V 

• . • • • 

^ "Querying SSD" "Mary" ^ "E-Commerce" "John" 

Article , ^ ' Article 

y ^ 

Title /K chapt]]r^‘l‘’=‘'’‘“ 

^ J-A. - J 

"XOL" Title 

"XML basics" 



Fig. 1. An example query (left) and database (right) 



full-hedge query languages for semistructured data [22,12,36] are not considered 
in this paper. Thus, only the “select” parts of queries are mentioned, possible 
“construct” parts are left implicit.) 

Some of the articles retrieved by the query Q are common to several projects 
in the database V (Figure 1). Evaluated against the database V, the query Q 
returns the answer tuples {x i-l- &l,y i-)- &3,z i-l- “Mary”,t !->• “XML basics”) 
and {x !->■ &2, y !->• &3, z i-l- “John” , t !->• “XML basics”). 

More generally, if the string “XML” occurs in k titles of article &3, then Q 
admits 2k answer tuples all referring to &3. Furthermore, if an article is shared by 
n projects, then Q has n-k answer tuples. In case of more complex queries and/or 
of database items with more complex interconnections, as often arise in Web and 
e-commerce servers, an enumeration of answer tuples a la Prolog or a la SQL 
results in a combinatorial explosion. If e.g. the functional dependency Project — >■ 
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Manager does not hold and if each of the two projects has m managers, then 
Q admits 2 ■ k ■ m answer tuples. In general, such a product giving the number 
of answers of a query like Q can have any number of factors, i.e. the number of 
answers is exponential. 

Arguably, for many applications such an enumeration of answer tuples is not 
appropriate. Instead, a data structure stressing the common subelements shared 
between (parts of) answer tuples as well as their graph relationships would often 
be more convenient. Let us call aggregated answer such a hypothetical data 
structure. Aggregated answers can help to recognize “bottlenecks” in the “answer 
space” . Furthermore, aggregated answers make advanced query answering forms 
possible, cf. Section 7. 

In this paper, Complete Answer Aggregates (CAAs) are proposed as a for- 
malization of such a notion of aggregated answer and CAAs are shown to be 
efficiently computable. A CAA reflects the graph structure of the query it is 
computed from. The CAA computed for Q over an example database (larger 
than V) has a “slot”, represented in the figure below by a rectangle, for each 
variable x, y, z, and t. A slot for variable v contains possible binding elements 
for v: E.g. the slot for x contains project identifiers and the slot for z contains 
manager identifiers. The edges in are CAA links. They represent (sequences of) 
database edges. Note that a presentation of a CAA such as in the following figure 
is not intended for end users. 
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Admittedly, it is possible to generate in some cases simple kinds of aggregated 
answers with usual query languages like [4,10], but this requires nested queries 
that might be complex [1] (cf. Sec. 4.1). In contrast, with CAAs, no complex 
queries are needed and aggregated answers are obtained in all cases. 

CAAs are nothing else than semistructured data items of a certain kind. 
Thus queries can be posed to CAAs. Provided that the implementation of the 
query language is “CAA aware”, such a querying can be performed without 
requiring from the user or application issuing the query to be aware of the 
internal structure of the CAAs constructed during query evaluation. Thus, CAAs 
are a convenient basis for an iterative, or cascade style query-answering: The 
evaluation of a query yields a CAA which can be stored and in turn queried 
using the same query language. Such a cascade style query-answering is often 
sought for in e-commerce applications [18]. Cf. Section 7 for more details. 
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3 Preliminary Notions 

Definition 1. A database is a tuple V = (N,E,Ln,Le,An,Ae) where 

• N is a (finite) set o/ nodes, 

• E C N X N is a set of directed edges, 

• Le is a (finite) set o/node labels, 

• Le is a (finite) set of edge labels or features, 

• Ae : Lm 2^ is a (total) function assigning to each node label a set of nodes, 

• Ae ■ E ^ Le is a (total) function assigning labels to edges. 

V is a sequence (resp. tree, DAG, graphj database if the structure imposed by 
E upon N defines a (finite) set of sequences (resp. trees, DAGs, graphs). 

For simplicity, we do not consider multigraphs with multiple edges between 
nodes. Note that sequence, tree, and DAG databases are acyclic and that nodes 
in sequence (resp. tree) databases have at most one parent and one child (resp. 
one parent). Node labels are aimed at modeling attributes and attribute values. 
Multiple node labeling is allowed, so as to model textual content: A label w 
of node n might express that a string w occurs in the textual content of n. In 
the sequel, regular expressions a over the alphabet Le of edge labels will be 
considered. 

Definition 2. Let T> = {N, E, Lpf, Le, A^, Ae) be a database and a a regular 
expression over Le. A node d € N is an a-ancestor of e € N, if there exists a 
path from d to e such that the sequence of labels along the path belongs to the 
regular language C{a) induced by a. 

Definition 3. Let X be an enumerable set of variables andV = {N, E, L^, Le, 
Ae,Ae) a database. An atomic path constraint is an expression of the form 

• A{x) called a labeling constraint, 

• X — >■! y called a child constraint, 

• X y called a f -child constraint, 

• X — >■+ y called a descendant constraint, 

• X — >-Q y called an a-descendant constraint, 

where x,y € X , A G L^ , f G Le, and a is a regular expression over Le. Atomic 
path constraints of the latter four types are called edge constraints. 

In the sequel, we only consider regular expressions a where the empty word 
e does not belong to C{a). Edge constraints of the form x — y, x y, 
and X — >■+ y can be seen as special cases of a-descendant constraints. Without 
loss of generality, the (non-atomic) path constraints considered in this paper 
are conjunctions of atomic path constraints containing at most one atomic edge 
constraint x — >■? y for each (ordered) pair (x, y) of query variables. Depending 
on their nature, edge constraints might impose sequence, tree, DAG, or graph 
structures on the variables. 

Definition 4. A sequence (resp. tree, DAG, graphj query is a conjunction of 
atomic path constraints the atomic edge constraints of which impose a sequence 
(resp. tree, DAG, graph) structure on its variables. If D is a database and Q a 
query with set of variables Xq, then {Q, Xq,D) is an evaluation problem. 
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Note that queries of either type can be represented as graphs. 

Definition 5. Let T> = {N, E, Le, Ae) be a database and Q a query. 
An answer to Q in V is a mapping pL that assigns a node in D to each variable 
in Q in such a way that 

• jJi{x) € Ae(A) whenever Q contains the labeling constraint A(x), 

• (/i(x), /i(y)) € E whenever Q contains the child constraint x — y, 

• {pl{x) , p,{y)) G E and AE{tL{x), p,{y)) = / whenever Q contains the f -child 
constraint x y, 

• p,{x) is an a-ancestor of pi{y) whenever Q contains the a-descendant constraint 
X ~^a V- 

If {Q, Xq,T)) is an evaluation problem and p, an answer to Q in T>, then p is 
called a solution of (Q, Xq,T>). 

Note that according to Definition 5, cyclic, i.e. proper graph queries have no 
answers in acyclic, i.e. sequence, tree, or DAG databases. 



4 Complete Answer Aggregates 

Let V = {N, E,Le,Le,An,Ae) be a database and Q a query with set of varia- 
bles Xq. 

Definition 6. An answer aggregate for the evaluation problem {Q, Xq,T)) is a 
pair (Dom, II) such that 

• Dom : Xq — >■ 2^ assigns to each variable of Q a set of nodes of V such that 
each d G Dom{x) has the label A if A{x) is a labeling constraint of Q, 

• n maps each edge constraint x — y (resp. x y, x — >■+ y, x — >-q y) of Q to 
a set n{x,y) C Dom(x) x Dom(y) such that for all (d,e) G II{x,y) e is a child 
(resp. f -child, descendant, a-descendant) of d. A node d G Dom{x) is called a 
target candidate for x. A pair (d,e) G U{x,y) is called a link between d and e. 

Definition 7. An instantiation of an answer aggregate (Dom, II) is a mapping 
p that maps each x G Xq to a node p(x) G Dom(x) such that (p(x),p(y)) G 
n(x,y) whenever Q contains an edge constraint x — y, x y, x — >■+ y, or 
X ~^a y. Each node of the form p(x), as well as every link of the form (p(x),p(y)) 
(for x,y G Xq) is said to contribute to instantiation p. 

Note that each instantiation of an answer aggregate for (Q,Xq,T>) defines an 
answer to Q in D. 

Definition 8. (Dom, II) is a complete answer aggregate (CAA) for (Q,Xq,V) 
if every answer to Q in T> is an instantiation of (Dom, II) and if every target 
candidate and every link of (Dom, II) contributes to at least one instantiation. 

Example 1. Assume that Q = X\ — >■+ X 2 A X 2 — >■+ X 3 A ■ ■ ■ A — >■+ Xg 

and assume that D consists of n > q nodes di, . . . ,d„ sequentially ordered (i.e. 
(di, di+i) G E for all 1 < i < n— 1). Q has (^) answers in D. For q = A and n = 8 
the CAA representing the 70 answers (each mapping the four query variables to 
database nodes) is depicted below. For one possible instantiation, contributing 
nodes and links are highlighted. 
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Lemma 1. There exists a unique CAA for each evaluation problem. 

Size of CAAs. Let {Q, Xq,T>) be an evaluation problem. In the following, q 
will denote the number of variables plus labeling constraints of Q, n the number 
of nodes of T>, and a the maximal number of ancestors of a database node plus 
the number of links between these nodes. Note that, if 2? is a sequence or a tree 
database, then a is bounded by 0{h) where h is the height of V. 

As a measure for the size of a CAA {Dom,lJ) for {Q, Xq,T>) it makes 
sense to retain the total number of target candidates and links in (Dom, II), 
i-e- ExgJCq \Dom{x) \ + Ex.yGXg \n(x,y)\ (where n{x,y) := 0 if a; and y are 
not related by an edge constraint in Q). Although the number of answers to a 
query Q may be exponential, the size of a CAA is at most linear in the size of 
the query and quadratic in the size of the database: 

Theorem 1. The size of the CAA for an evaluation problem {Q, Xq,T>) is 0{q- 
n ■ a). 

The full version [26] of this paper describes situations, where a better bound 
0{q ■ n) can be obtained. 

CAAs as Semistructured data: According to Definition 6, the links of the CAA 
for an evaluation problem {Q, Xq, D) are not labeled. If x — y is an /-child 
constraint in Q, then the label / can obviously be attached to each link in 
n{x,y). If X — >-1 y is a child constraint in Q, then each link of TI{x,y) clearly 
represents a labeled edge of V, i.e. can be labeled like this database edge. Thus, 
if all edge constraints of the query are child- or /-child constraints, the CAA 
trivially extends to a semistructured data item. If the query Q contains des- 
cendant constraints of the form x^+y or x^ay and if 2? is a sequence or 
tree database, then for each (d, e) € II{x,y) there exists in 22 a unique path tt 
from d to e. The corresponding sequence of labels can be attached to the link 
(d, e) of the CAA yielding a semistructured data item in this case, too. If V is 
a proper DAG graph database, a link (d, e) G II{x,y) in general stands for a 
regular expression. Although this is less immediate than in the previous cases of 
sequence and tree databases, this expression can serve as a label of a CAA link. 
For space reasons, details are left out. 
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CAAs for extended query formalisms: Many query languages for XML data 
and semistructured data [11,6,23] allow for arbitrary value comparisons (joins), 
i.e. database leaves are assumed to carry values, and queries may contain value 
comparisons. A CAA does not suffice to completely represent exactly the answers 
to a query with value comparisons. Nonetheless, the CAA for the join-free part 
of the query can be built up representing a coarsening of the set of answers. 
Note that such a generalization of a query might speed up its evaluation and be 
appropriate for some applications such as e-commerce [18]. Extensions to CAA 
as defined here can be thought of for a faithful representation of the answers of 
a query with value comparisons. 

5 A Query Hierarchy 

We would like to clarify how the structural properties of both query and database 
affect the complexity of computing the CAA. To this end we introduce the 
following notions: 

Definition 9. Let £V = {Q, Xq,T)) be an evaluation problem. £V is of type 
S-S (Sequence-Sequence) if Q is a sequence query and D a sequence database. 
Evaluation problem of types S-T, S-D, S-G, T-S, T-T, T-D, T-G, D-S, D-T, 
D-D. D-G, and G-G are similarly defined: The first letter (S: sequence, T: tree, 
D: DAG, and G: graph) denotes the type of the query, the second, that of the 
database. 

The thirteen classes of evaluation problems of Definition 9 form a hierarchy of 
increasing structural complexity. This hierarchy does not include the types G-S, 
G-T, G-D, because they would correspond to evaluation problems with cyclic 
(i.e. type G) queries: According to Definition 5, evaluation problems with cyclic 
queries have no solutions in acyclic (i.e. type S, T, and D) databases. Note that 
all the evaluation problems of the types specified in Definition 9 are non-trivial 
in the sense that they might have solutions. 

Definition 10. A query is called simple if all its edge constraints are child, f- 
child, or descendant constraints. An evaluation problem {Q, Xq,D) is simple if 
Q is simple. 

According to Definitions 4 and 3, a query is simple if it does not involve regular 
expressions. For simple queries, the algorithms described in the next section have 
optimal complexity. 



6 Computation of CAAs 

In this section an algorithm for computing the GAA of a simple sequence query 
is first given. Then, its adaptation to simple tree queries is outlined. For space 
reasons, further adaptations, e.g. to tree queries involving regular path expres- 
sions, are not explained in this paper. A polynomial algorithm for this case is 
given in the full version [26] . 
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Simple Sequence Queries 

The algorithm depicted below takes a simple sequence query Q and a database 
T> as arguments and computes the CAA (Doiiiq, II q) for the evaluation problem 
induced by Q and T>. Starting from an empty domain (line 3) and an empty set 
of edges (line 4) the algorithm adds target candidates to the domains (slots) of 
the query variables (line 15) and adds appropriate links to the set of edges (lines 
30, 37). A pair (x, d) is said to be “added” to express that target candidate d is 
added to the domain (slot) of x. Possibly, pairs {x, d) are added that are “illegal” 
in the sense that d ^ DomQ(x). 



1 


procedure Aggregate coinp_agg(Q,db) 


26 


if dy_i ^ Dora(y) then 


2 begin 


27 


if map(Dom,77,y,dy_i) then 


3 


Dom:=eTnpty DomainSet; 


28 


begin 


4 


77: “empty EdgeSet; 


29 


map.f ound : “true ; 


5 


q_l:=leaf of Q; 


30 


add (y,x,dy_i,dx) to 77; 


6 


for all nodes d in db do 


31 


end 


7 


map (Dorn , 77 , q_l , d) ; 


32 


else 


8 


Agg : “Aggregate (Dom , 77) ; 


33 


begin 


9 


clean(Agg) ; 


34 


if not isjred(Dom,y,dy_i) then 


10 


return Agg; 


35 


begin 


11 


end; 


36 


map Jound : “true ; 


12 




37 


add (y,x,dy_i,dx) to 77; 


13 


procedure boolean map(Dom,77,x,dx) 


38 


end; 


14 begin 


39 


end; 


15 


add dx to Dom(x); 


40 


if map_f ound“f alse then 


16 


if dx satisfies the labeling 


41 


color jred(Dom,x,dx) ; 


17 


constraints of x then 


42 


return map.found; 


18 


begin 


43 


end; 


19 


if x=root then retuim true 


44 


else /*labeling constraints not satisfied*/ 


20 


else 


45 


begin 


21 


begin 


46 


color j:ed(Dom,x,dx) ; 


22 


y: “parent (x) ; 


47 


return false; 


23 


Anc : “appr .ancestors (dx , x , y) ; 


48 


end; 


24 


map.found: “false ; 


49 


end; 


25 


for all dy_i € Anc do 







Illegal pairs are detectecd and marked “red” (lines 41, 46). Only “legal” links, 
i.e., links in Hq, are introduced. As a last step, for each red pair (x,d) node 
d is deleted from the slot of x. After this slot “cleaning” (line 9), the CAA 
{DoniQ, IIq) is obtained. 

Let xi,. . . ,Xp be the ordered sequence of query variables such that Xi — >-7 
Xi+i (1 < * < p) occurs in the simple sequence query. The algorithm starts with 
the (unique) query leaf Xp (line 5). The outermost loop (line 6) calls for each 
database node dp the recursive function map. This function returns a boolean 
value indicating whether the pair (xp, dp) is legal. First, the pair is added (line 
15). Assume the pair (xi,di) has been added. If z = 1, then the pair is legal (line 
19). Otherwise {xi, di) is legal if and only if there exists a legal pair (xi_i, di_i) 
such that {di-i, di) satisfies the edge constraint Xi-i — >-7 Xi in Q. Hence, for each 
“appropriate” ancestor di-\, which depends on the kind of the edge constraint, 
it is checked whether (xi_i, di_i) is a legal pair (lines 27, 34). Two cases are 
distinguished: 

1. If {xi-i, di-i) has not been previously added, then the legality of {xi-\, di-\) 
is checked by a recursive call of the function map (line 27). 
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2. Otherwise {xi-i,di-i) is illegal if it is marked red (line 34). 

In both cases, if {xi-\, di-i) is legal then a link between {xi-i,di-\) and {xi, di) 
is added (lines 30, 37). After inspection of all appropriate ancestors of di, the 
pair {xi, di) is marked red if no legal pair {xi-\, di-\) is found (line 40). Similarly 
{xi-\,di-i) is marked red if di does not satisfy all labeling constraints of Xi (line 
46). 



Complexity: Clearly, for each pair (x,d), map is called at most once. Hence, 

the total number of calls to this function is bounded by the maximal number of 
pairs q ■ n. Under the assumption that each database node has links pointing to 
parent nodes, computing the set of appropriate ancestors of a target candidate 
Xi takes time 0{a). Whether a node Xi satisfies a given labeling constraint A{xi) 
can be checked in constant time. Since each labeling constraint refers to a unique 
query variable, the total time needed for all tests related to labeling constraints 
is 0{q ■ n). Cleaning takes time 0{q ■ n). Therefore, the overall complexity is 
0{q ■ n ■ a). 



Simple Tree Queries 



In the case of simple tree queries we introduce the notion of an adapter point as 
the bottom-most common query node of two query paths. The query paths are 
processed consecutively. For each query path, the above algorithm is modified 
as follows: When reaching an adapter point Xi-i that has already been visited 
during processing of another path, no new target candidate for slot Xi-\ is 
introduced. Furthermore, only the already collected non-red target candidates 
{xi-i, di-i) are used for links between target candidates in Xi-\ and Xi. If, after 
a query path has been fully processed, a target candidate {xi-\,di-i) for an 
adapter point Xi-i has no links to a target candidate in Xi, then it is marked 
red. 

With simple tree queries, cleaning is more complicated. Call downwards iso- 
lated target candidates (xi_i, di_i) such that for some child Xi of Xj_i there are 
no links from {xi-i,di-\) to a target candidate within the slot Xi. After ente- 
ring all target candidates, downwards isolated target candidates are detected. 
Since they are illegal, they are marked red. Removal of red nodes may result 
in new downwards isolated target candidates. The removal of red target can- 
didates is based upon a variant of (the second part of) the well-known AC-4 
arc-consistency algorithm [29]. 



Complexity: Since the recursive calls do not fill slots of adapter points twice, in 
this case as well no target candidates are processed twice in the first part of the 
algorithm. This gives a time complexity of 0{q ■ n - a) for this part. For cleaning 
an adaption of arguments from [29] yields time complexity 0{q ■ n ■ a). Space 
complexity is still 0{q ■ n ■ a). 
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Adding Regular Path Expressions 

A simple modification suffices to adapt the algorithms for sequence and tree que- 
ries described above to queries involving regular path expressions. At line 23, if 
the query contains an a-descendant constraint y -^a x, then the set of appro- 
priate ancestors of the current node dx is now the set of all a-ancestors of the 
database node dx. 

The computation of the sets of a-ancestors needs some extra time. Using 
standard techniques from automata and graph theory, it is shown in the full 
version [26] that the resulting time complexity is 0{q^ ■ n ■ a). 



Simple DAG and Graph Queries 

Four evaluation problems of the hierarchy turn out to be NP-complete with 
respect to combined complexity. 

Define the weight of a node (resp. variable) of a database (resp. query) as 1 
plus the number of labels attached to the node (resp. variable). Define the size of 
a database (resp. query) as the sum of the weights of its nodes (resp. variables) 
and its the number of edges. Define the size of an evaluation problem (Q, Ag, T>) 
as the sum of the size of Q and the size of T>. It is shown in the full version [26] 
that so-called l-in-3 problems over positive literals [13] can be encoded as D-T 
evaluation problems, using a polynomial translation. The following theorem is a 
simple consequence. 

Theorem 2. Whether a simple D-T (resp. D-D, D-G, G-G) evaluation problem 
{Q, Xq,D) has a solution is NP-complete with respect to the size of (Q,Xq,D). 



Table 1. Complexity for computing the CAA resp. (*) for deciding solvability 





Database 

S T D G 


Queries S 

without T 

reg. path D 

expressions G 


0{q-n-a) 0{q-n-a) 0{q-n-a) 0{q-n-a) 

0{q-n-a) 0{q-n-a) 0{q-n-a) 0{q-n-a) 

0(q ■ e ■ ■ a) {*) NP-compl. (*) NP-compl. (*) NP-compl. 

(*) NP-compl. 


Queries with S 
reg. path expr. T 


0{q"‘ ■ n - a) 0{q^ ■ n ■ a) 0{q‘‘ ■ n ■ a) 0{q‘‘ ■ n ■ a) 

0{q^ ■ n ■ a) 0{q^ ■ n ■ a) 0{q^ ■ n ■ a) 0{q^ ■ n ■ a) 



Table 1 summarizes the complexity results for the computation of CAAs (resp. 
for deciding solvability, marked with *) given in this section. The case D-S in 
the upper table is established in the full version [26]. The parameter e denotes 
the number of edges in the query. 

Due to the results on the size of CAAs (cf. Section 4), the bounds given at 
lines S and T of the upper table are optimal. 
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In practice the worst case time complexity for computing a CAA can be 
exponential with D-T, D-D, D-G and G-G evaluation problems. This could be 
faced by a polynomial-time computation of an “upper approximation” to the 
GAA, i.e. an answer aggregate yielding not only all answers, but also possibly 
containing target candidates or links not contributing to any answer. Such an 
upper approximation to a GAA can be obtained by first selecting a spanning 
tree T of the considered query, then compute the GAA for the subquery induced 
by this spanning tree. Links representing possible interpretations of the query 
edges that have been omitted from Q might then be added. Arc consistency 
techniques [29] can be used to erase nodes and edges that do not contribute to 
any instantiation. 

7 Advanced Query Answering Using CAAs 

Once the GAA for an evaluation problem has been computed, it can be exploited 
for advanced query answering techniques we call answer searching and for answer 
browsing. These notions are explained referring to the example of Section 2: A 
query Q to a research project database T> retrieves projects x and the managers 
z of these projects such that the string “XML” occurs in a title element u (at 
any depth) of some article y of a project x. 

Answer Searching: In essence, the GAA of a query is a data structure making ex- 
plicit the interdependencies between the answers to a query. Gomparing queries 
and investigating the interrelationships between the various answers to a query 
is needed in many applications. GAAs can be used for scanning, comparing, filte- 
ring, ordering in the style of search engines as well as for analyzing in any other 
manner the answers to a query. Particularly promising are search primitives for 
detecting commonalities and differences between answers, computation of aggre- 
gate values like averages, maxima and minima. Note that the nodes of database 
items stored in a GAA potentially give access to the subelements rooted at these 
nodes, thus giving rise to a semantically rich “answer searching” . 

The set of answers for the above-mentioned query Q can be searched for: 

• managers leading the highest number of project, 

• for managers leading at least 2 projects, 

• for projects with at least 10 XML articles. 

To answer such queries, aggregate values have to be computed from the GAA 
of query Q. Note that such semantically related aggregate values are often com- 
puted in the same query. Many database applications like molecular biology se- 
quence analysis and e-commerce require to perform such advanced comparisons 
from large answer sets computed from some “base query” [18]. 

In many cases, the querying of the GAA of a base query can be carried out 
automatically. In some cases however, such an automatic “search” is not sufficient 
or not possible, an interactive “browsing of the answer space” is desirable. 

Answer Browsing: A visualization of a GAA for a query can be a convenient 
basis for browsing the answers to that query. One can easily identify nodes of 
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the CAA of query Q, like project 5, that deserve special attention by looking 
at the number of departing article links of CAA 1 in Figure 2. More elaborate 
visualization facilities would make it possible to directly browse between e.g. 
projects, articles, titles by following the links of the CAA. 

If the CAA has a large number of nodes, then the user might get “lost in 
answer space” . In such cases, it might be beneficial to restrict the visualization 
to a view of the CAA. In case of query Q, one might wish to restrict, say, the 
CAA to information associated with projects with at least three XML related 
articles. Also, slot hiding can provide with a better overview. Note that slot 
hiding corresponds to the projection operator of relational databases. If slot 
hiding is applied, then CAA links are inherited. Applying slot hiding to the 
running example yields CAA 2 depicted in Figure 2. 



1 



X (projects): 6 answers 




z (manager): 2 answers 




y (article): 12 answers 



Fig. 2. Three presentation forms of CAAs 



A further visualization technique based upon CAAs is clustered aggregation. 
Assuming that nodes have further attributes, CAA nodes that have identical 
attribute values can be merged. In case of query Q, if projects have a “country” 
attribute, if projects pi, ps, p 4 are French, projects p 2 and pe are German, and 
if P 5 is a US-project, then applying clustered aggregation might yield CAA 3 of 
Figure 2. This presentation might be used to show which countries have projects 
of interest and how many. 

General picture: cascade style query-answering: The possibilities for automated 
search as well as interactive browsing of answer sets the CAAs offer suggests 
that, for many applications, query answering can be processed in two or more 
successive phases, the first of which resulting in the construction and storing 
of the CAA, the following phases consisting in an inspection of this CAA or in 
the construction of further, more specialized CAAs. CAA inspection can consist 
both in an automated search or in an interactive browsing. In addition, to help 
analyzing and/or browsing answers, such a query answering based upon CAAs 
can help to react rapidly to user query requests. 
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Cascade style query-answering query answering is possible with any query 
language, indeed. The contribution of CAAs lies in an intermediate data struc- 
ture supporting this form of query answering which can be efficiently computed. 
Note that for many novel applications such as e-commerce such a cascade style 
query-answering is needed [18]. 



8 Related Work 

Tree Databases: For tree queries and tree databases, a simplified form of CAA 
based on the Tree Matching formalism [20] has already been introduced in [25, 
24]. The present paper significantly extends over this early work. 

Query formalisms for semi-structured data: Several query models for XML 
and semistructured data have been designed and/or implemented and used [23, 
3,33,15,30,31,4,10], cf. [4,10] for surveys. These query models are more ambitious 
than the model presented in this paper, however, in contrast to this paper they 
are not devoted to aggregating answers. Thus, the contribution of this paper is 
widely orthogonal and complementary. 

Conjunctive queries: The queries considered in this paper are a special case of 
the conjunctive queries over relational databases as investigated in, e.g., [7,16]. 
However, it must be stressed that the distinction between tree queries and DAG 
or graph queries does not correspond to the conventional distinction between 
acyclic and cyclic conjunctive queries in database theory. 

The distinction of database theory between acyclic and cyclic conjunctive 
queries refers to the hypergraph of the query, which is an undirected graph. 
In contrast, the query atoms considered in this paper are unary (labeling con- 
straints) or binary (edge constraints). A binary atom r(x,y) imposes a fixed 
orientation x ^ y on {x, y} which reflects the direction of edges in the database. 
The conjunctions considered in this paper are such that the set of their binary 
atoms induces a sequence, tree, DAG, or graph structure on the query. Hence, 
the NP-hardness result for DAG-queries given in Section 6 is not in conflict with 
general results on polynomial tractability of acyclic conjunctive queries. 
Dynamic programming and constraint reasoning: The algorithms descri- 
bed in Section 6 are closely related to methods of dynamic programming and to 
the arc-consistency techniques for constraint networks, cf. the full version [26] 
for details. 

Index structures: Index structures as discussed in [23,2,14,28,27] can be used 
to improve the practical efficiency of the computation of GAAs. Details can be 
found in the full version [26] . 

9 Conclusion 

The paper motivated and introduced “complete answer aggregates (GAAs)” as 
a model for aggregating the answers to a sequence, tree, DAG, or graph query 
in a semistructured database. Algorithms for the computation of GAAs for que- 
ries of various structural kinds have been presented. A hierarchy of evaluation 
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problems the CAAs of which can be computed in polynomial time (with respect 
to combined complexity) has been given. Cases have been characterized where 
computation of CAAs (emptiness problem) is NP-complete. 

The query model presented in this paper has been implemented for the special 
case of tree queries and of tree databases. The implementation is being tested 
with large collections of complex structured documents. The prototype currently 
available does not handle regular path expressions, but can cope with left-to-right 
order constraints between the children of a query node, as needed in document 
management. The necessary adaptation of the notion of a CAA as well as the 
mathematical and algorithmic background is given in [25,24]. For this expanded 
signature, the time complexity of the algorithm for computing the CAA is 0{q ■ 
n ■ a ■ log{n)). The additional logarithmic factor comes from the fact that order 
information is not being taken into account and handled during query evaluation. 
An “answer browser” based on CAAs is currently being developed. 
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Abstract. We study the problem of pre-computing auxiliary informa- 
tion to support on-line range queries for the sum and max functions on a 
datacube. For a d-dimensional datacube with size n in each dimension, we 
propose a data structure for range max queries with 0((4L)'^) query time 
and 0 (( 12 L^n^'^^ 7 (n))‘*) update time where L G logn} is a user- 

controlled parameter and 7 (n) is a slow-growing function. (For example, 
7 (n) < log* n and 7 ( 2 ^^^®) = 3.) The data structure uses 0 (( 6 n 7 (n))'^) 
storage and can be initialized in time linear to its size. There are three 
major techniques employed in designing the data structure, namely, a 
technique for trading query and update times, a technique for trading 
query time and storage and a technique for extending 1 -dimensional data 
structures to d-dimensional ones. Our techniques are also applicable to 
range queries over any semi-group and group operation, such as min, 
sum and count. 



1 Introduction 

Recently, research in On-Line Analytical Processing (OLAP) [14] has attracted 
a lot of attention. A popular data model for OLAP applications is the data 
cube [19] or the multi-dimensional databases [15,1]. In this model, an aggregate 
database with d functional attributes and one measure attribute is viewed as a d- 
dimensional array. Each dimension corresponds to a functional attribute and the 
value of an array entry corresponds to the measure attribute. One of the main 
research focuses is concerned with the orthogonal range query problems, i.e., the 
pre-computation of auxiliary information (data structures) to support on-line 
queries of various functions such as SUM, COUNT, AVERAGE, MAX and MIN 
over values lying within an orthogonal region, see [22,21,23,18,7]. These queries 
provide useful information for companies to analyze the aggregate databases 
built from their data warehouses. 

A generic orthogonal range query problem can be stated as follows. Given an 
aggregate database T with N records, each consisting of a key field corresponding 
to a point in a d-dimensional space and a value field, preprocess T such that 
subsequent queries f{Ri,R 2 ,...,Rd) can be answered efficiently. Here / is a 
function definable over a variable number of values, such as MAX, MIN, SUM, 
COUNT, ENUMERATE, etc; and the query /(i?i, i? 2 , • ■ • , Rd) asks for the value 

* This research was fully supported by a grant from the Research Grants Council of 
the Hong Kong SAR, China [Project No. 9040314 (RGC Ref. No. CityU 1159/97E)]. 



J. Van den Bussche and V. Vianu (Eds.): ICDT 2001, LNCS 1973, pp. 361—374, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




362 



C.K. Poon 



when applying / on all the values lying within the orthogonal region Ri x R 2 x 
• • • X Rd where Ri specifies an interval in the i-th dimension, for I < i < d. 
For example, the query MAX(i?i, i? 2 , • ■ • , asks for the maximum among 
all values lying within the orthogonal region Ri x R 2 x ■ ■ ■ x Rd- The query 
ENUMERATE(i?i, R 2 , ■ ■ ■ , Rd) asks for all the values present in the range. 

Note that the query regions are unknown before the preprocessing and the 
data structure should be capable of handling all possible query regions. One 
solution for the problem is to store nothing except the original database. Then 
query may take 0{N) time in the worst case. We call this the lazy approach. 
Another solution, which we call the workaholic approach, is to pre-compute the 
answers for all possible query regions. Then query takes constant time but there 
are 0{N'^) pre-computed answers to be stored. As N is typically very large in 
OLAP applications, both solutions are unsatisfactory. Therefore, the crux of 
the problem is to design a data structure with close to constant query time and 
nearly linear storage simultaneously. Added to the difficulty of the problem is the 
fact that the database T may change over time. Therefore, another performance 
measure for the data structure is its update time. 



1.1 Previous Results in Computational Geometry 

There is a rich body of research on orthogonal range queries when the data 
points are sparsed. Under this data distribution, merely storing the points in a 
space-efficient manner while allowing users to quickly locate the points within 
an orthogonal region (so that they can subsequently be enumerated at constant 
time per point) is highly non-trivial. Therefore, the range enumeration problem 
is of central importance and has been extensively studied, see [3,16,24,6,5,4,8,9]. 
As it turns out, many of the ideas used in these data structures can be adapted 
for other range queries. 

For range sum and count problems, a classical data structure is the ECDF 
tree of Bentley [4] obtained by applying his multi- dimensional divide- and- conquer 
technique. It requires 0{N N) storage and has O(log'^A^) query time. 

Using the idea of downpointers, Willard [27] improved the query time by a factor 
of log N. Chazelle [9] further improved the storage by a factor of (roughly) log N. 
For example, for any d > 2, and any small constant e > 0, he exhibited data 
structures with 0{N N) storage and 0(log‘^~^ N) query time for range 

count queries, and 0(fVlog'^~^~'’'^ A^) storage and 0{log‘^~^ N) query time for 
range max queries. He also observed that 1-dimensional range max queries can 
be answered in constant time and 0{N) space by combining the Cartesian tree of 
Vuillemin [26] and the nearest common ancestor algorithm of Harel and Tarjan 
[ 20 ]. 

To handle updates as well, the best known data structures typically require 
0{log'^ N) update time, see Willard [27], Willard and Lueker [28]. Also, allowing 
for updates often incurs a slowdown in the query time. For example, Willard 
and Lueker [28] devised a transformation that adds range restriction capabilities 
to dynamic data structures by increasing query time by a factor of O(logiV), 
provided the aggregate function / satisfies certain decomposability conditions. 
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If fast update time is imperative in an application, then the structure by Ravi 
Kanth and Ambuj Singh [25], which has 0{logN) update time and 0{N'^) query 
time, may be an alternative. 

Various lower bounds suggest that these results are close to optimal, see [10, 
11]. In particular, it was proved in [11] that 12((log A^/ log(2S'/iV))‘^“^) is a lower 
bound on the query time for range sum and max queries when the data structure 
uses 0{S) storage and is oblivious to the values of the data points. (In fact, the 
result applies to any semigroup operation possessing the so-called faithfulness 
property, which is enjoyed by most semigroups). For the dynamic range sum 
and count problems, Fredman [17] proved that n{N{logNY) time is necessary 
for performing a sequence of N operations containing insertions, deletions and 
queries. 



1.2 New Perspective in OLAP Environment 

Ho ET. AL. [22] pointed out that the non-linear storage requirement may pose a 
problem when applying the above data structures in an OLAP application. In 
particular, the 0(log‘^“^ N) factor can be devastating in an OLAP application 
which has, say, d = 10 dimensions and N = 10® records. Given the lower bounds 
mentioned before, it seems difficult, if not impossible, to build data structures to 
support efficient OLAP queries. On the other hand, it is also observed that data 
points often form clusters in many applications, see [22] and [13] for example. 
Suppose the data set is sufficiently dense or clusters of dense data points can be 
found readily [30], it is reasonable to consider orthogonal range queries in the 
following situation which we call the dense settings. The data points are stored in 
a multidimensional array of size n in each dimension and there are N = n‘^ data 
points in the array. (In contrast, N is much less than n'^ in the sparse settings 
discussed in the previous subsection.) The index of the array is assumed to be 
integral. (If the original dimension is non-integral, we can work with the rank 
space of that dimension.) 

Under the dense settings. Ho et. AL. [22] proposed the prefix sum data struc- 
ture that achieves 0{2‘^) query time for range sum queries while using only 0{N) 
extra storage. The update cost is, however, 0{N) in the worst case. Geffner et. 
AL. [18] designed an extension of the data structure, called the relative prefix sum, 
which requires only 0{'/N) update cost. Ghan and loannidis [7] further studied 
the tradeoff between the query and update costs. They proposed the hierarchical 
rectangle cube and the hierarchical hand cube, both of which are experimentally 
shown to outperform the relative prefix sum structure. Fredman [17] gave data 
structures that support prefix sum queries and updates in 0(log'^ N) time using 
O(Alog'^A) storage. 

For range maximum queries, [22] studied a quad-tree-like structure which ta- 
kes 0{N) storage. In the worst case, it answers queries in OllogN) time in the 
1-dimensional case and 12(A^“^/‘^) time for d-dimensional case. Ho et. AL. [21] 
then investigated various techniques to improve the average query time. Gha- 
zelle and Rosenberg [12] designed a much more efficient data structure which has 
0(a'^(s,n}} query time when 0(s‘^) storage is allowed where a(s,n), the func- 
tional inverse of Ackermann’s function, is an extremely slow-growing function. 
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1.3 Our Contributions 

In this paper, we design several data structures for range queries by exploit- 
ing the properties in the dense settings. Our data structures are array-based 
rather than linked-list based. Both the query and update algorithms involve in- 
dex calculations, array accesses and applications of the queried operator (e.g., 
MAX, SUM). Number of operator applications is bounded above by the number 
of array accesses. Time for index calculation is negligible compared with array 
accesses when d is large in our construction. Moreover, if arrays are stored in 
secondary storage, cost of array accesses dominates that of CPU computations. 
Therefore, our formula only accounts for the number of array accesses. 

First, we propose a data structure for 1-dimensional prefix max queries (a 
special case of range queries). For a 1-dimensional array of size n, the data 
structure has 0{L) query time, 0{Ln^^^) update time and requires 0{n) storage, 
where L G {1, . . . ,logn} is a user-controlled parameter. When L = 1, the query 
time is the fastest but update is slow. As L increases, query time increases 
while update time decreases. When L = logn, both query and update requires 
0(log n) time. Our technique is applicable to any semi-group or group operation. 
In particular, we have a data structure for 1-dimensional prefix sum queries 
having the same performance. By an observation in [2,22], the structure also 
answers 1-dimensional range sum queries in 0{L) time. 

Second, we propose a data structure for 1-dimensional range max queries 
which has 0{L) query time, 0(L^n^/^7(n)) update time and 0{n"/{n)) storage 
where y(n) is a slow-growing function, e.g., j{n) < log* n. Putting L = 1, our 
structure requires at most 4 array look-ups. This is the fastest query time among 
all known data structures with such a small assymptotic storage complexity. 
The previously best result requires 7 look-ups by [12]. As we will see next, the 
constant is important when we extend the data structure for higher dimensional 
queries. Our construction borrows a lot from the recursive technique of [12]. 
However, we have a better base case construction and this brought about a 
constant factor improvement in the query time for the same assymptotic storage 
complexity. 

Third, we define a class of data structures called oblivious storage sche- 
mes and propose a technique to extend such a data structure for 1-dimensional 
range queries to multi-dimensional range queries. Our technique generalizes 
the one used in [12] by taking care of updates as well. Applying it to our 1- 
dimensional range max structures (which is an oblivious storage scheme), we 
obtain a d-dimensional range max structure which has 0((4L)'^) query time, 
0 (( 12 L^n^/^ 7 (n))'*) update time and 0 (( 6 n 7 (n))'^) storage, where L G {1, . . . , 
logn}. Similarly, we obtain a d-dimensional range sum structure which has 
0{{2LY) query time, 0{{2L'n}/ ^Y) update time and 0((2nY) storage. This 
generalizes the results of [22,18,17]. 

The rest of the paper is organized as follows. Section 2 and 3 contain data 
structures for 1-dimensional prefix and range max queries respectively. In Section 
4, we discuss the concept of oblivious storage scheme and its application to 
generalizing 1-dimensional structures to higher dimensions. Finally, Section 5 
contains some open problems. 
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2 One-Dimensional Prefix Max Queries 

A prefix max query is a query of the form, MAX(0, i), i.e., the lower end-point of 
the range is fixed at index 0. Our construction, to be described shortly, is applica- 
ble to any commutative semigroup or group operator, including min, sum, count, 
etc. In particular, it results in a data structure for 1-dimensional prefix sum que- 
ries having the same performance. As observed in [2,22], SUM(f, j) = SUM(0, j) 
- SUM(0,t — 1). Therefore, this structure can also answer 1-dimensional range 
sum queries in twice amount of time. 

Without loss of generality, we assume the size of array A is n = for some 
integers 6 > 1 and L > 1. To explain the data structure, consider an {L + 1)- 
level complete 6-ary tree. We assign the n entries of A to the = n leaves of 
this tree. Next, we assign to each internal node the maximum value among the 
leaves of its subtree. Then to each node (leaf or internal), we compute and store 
a ‘prefix-max’ value which is equal to the maximum of the ‘assigned’ values of its 
left siblings and itself. To facilitate the discussion, we label the tree with array 
indices as follows. First, we number the levels from 0 at the leaves up to L at the 
root. The leaves are then labelled from 0 to n — 1 starting from the left. Internal 
nodes at level 1 are labelled from 0 to (n/b) — 1, and in general nodes at level w 
are labelled from 0 to {n/b'^) — 1. See Figure 1 for an illustration when 6=3, 
L = 3. 



level 

0 3 




Fig. 1. A tree structure for 6 = 3, L = 3 



Initialization and Storage: For every integer w = 1 to L, we define as an 
array with n/6™ entries, indexed from 0 to n/6™ — 1, so that 

Aw[i] = max{A[i6’" -I- j] | 0 < j < 6“} 

= max{Au,_i[i6 -f j] | 0 < j < 6}. 

Referring to Figure 1, Ayj\i] contains the ‘assigned’ value at node i in level w. 
We do not really store these L arrays. Instead, we store the prefix max of these 
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arrays and the base array A. (For uniformity, we let Aq = A.) More precisely, 
for every integer w = 0 to L— 1, we compute and store the array Pu,[0..n/&™ — 1] 
so that 

Pn,[i] = max{A^[j] I b[i/b\ <j< i}. 

Obviously, computing all the requires 0{n) time and storing them requires 

E ^ 

w—O 

storage for 6 > 2. 

Query: To answer the query MAX(0,n — 1), simply return — 1]. For a 

query, MAX(0,z), where i < n — 2, we convert z + 1 (the length of the interval 
[0..z]) to a base-6 number. The zc-th digit will tell us which entry in P^, is needed. 
Specifically, let the base-6 representation of z -I- 1 be Il-iIl -2 ■ ■ ■ Iq where 
is the zc-th digit. (Since z-|-l<rz— 1 = 6^ — 1, there are at most L non-zero 
digits.) Then we calculate 

= Il-1 - 1 

^L-2 = K^L-1 + 1 ) + Il-2 — 1 

= 6(/2 -|- 1) -|- Ji — 1 
Iq = b{I[ -I- 1) -I- Jo — 1 

and MAX(0,z) = max{P^,[/4,] | 0 < zz> < L, yf 0}. Therefore, query takes at 
most L array look-ups (and 0{L) time for index calculation). 

Update: When an update is made in the base array A, at most 6 entries in each 
of the prefix max arrays need to be changed. In particular, if A[i] is updated, 
then for each zc = 0 to L, Aw[iw\, where iw = \i/b'^\, may also be changed. 
Therefore, Pu][iw\iw/b\b+b— 1] (or at most 6 entries in P„) has to be changed 
and no other changes are needed. 

Now, we show how to update each P^ in time. Suppose we have 

updated Pm/’s for all w' < w, and Pw[j'] for all j' < j. To update Pw[j], consider 
node j in level w of the tree. We first initialize it to the maximum of its subtree, 
i.e., set Pw[j] = Pw-i[bj + b— 1]. (If zu = 0, then set Pw[j] = A[j] instead.) Next, 
if the node does not have a left sibling, i.e., j mod 6 = 0, then we are done with 
Pw[j\- Otherwise, set Pw[j] to the maximum between P„[j — 1] and Pw[j\- The 
total update cost is at most 2LrA^^ array reads/writes (and 0{Ln}^^) time for 
index computation). 

3 One Dimensional Range Max Queries 

We first present a simple data structure called the Bi-directional Prefix Max 
Structure (BPM) which is the key to our data structures. Putting suitable para- 
meters to our BPM structure, we obtain a data structure which answers range 




Orthogonal Range Queries in OLAP 



367 



max queries in 2 array lookups but requires a logarithmic blow-up in storage. We 
then reduce the blow-up factor by using a recursion technique in [12]. Through- 
out this section, we often refer to sub-intervals with boundaries located at certain 
positions. For convenience, we call an interval [i, j] an a-interval if i is a multiple 
of a and i + a = j +1. We define Z(a) = \ i/a\ a, i.e., the largest multiple of a less 
than or equal to i. 



3.1 The Bi-directional Prefix Max Structure 

We define the {I, h) -bi-directional prefix max structure for an array A[0..n — 1], 
denoted BPM{A, I, h), as a collection of 2h arrays, PRo, PRi, ■ ■ ■ , PRu-i, and 
PLq, PLi, . . . ,PLh-i, each of length n. For 0 < w < h, the content of array 
PLuj and PRw is as follows: 

PL„[z] = max{A[j] \ z( 2 »q <j<i} 

PR^[i] = max{ A[j] | z < j < Z( 2 »/) + 2™; - 1} 

This requires 2nh storage cells and can be initialized in 0{nh) time. Given 
a query MAX(z,j) where [z/(2™/)J -|- 1 = [j 7(2“I)J for some integer w £ 
1}, the range [z,j] spans across two adjacent 2“Fintervals. There- 
fore, MAX(z,j) = u\&yi{PRu,[i],PLw[j]}. We will describe the calculation of w 
later when we have specific values for the parameters I and h. 

For an update to A[i], we need to modify at most 21 elements in PLq and 
PRo, namely, PLo[*--*(/) -I-/ — 1] and PRo[i(^iy.i]. Similarly, we need to change at 
most 4/ elements in PLi and PRi, ..., and 2^1 elements in PLh_i and PR^-i- 
Therefore, updating BPM{A,l,h) requires at most 0{2^l) array accesses. We 
can reduce the update time by the technique in Section 2. We pick the same L 
for the prefix max structure of each 2“'Finterval for each w. Update time for a 
PRw or PLw is then reduced to 2L(2™/)^/^ array accesses. Summing up from 
zc = 0 to ft. — 1, updating requires 8L‘^{2^1)^^^ array accesses. However, drift 
storage cells are required and each query takes 2L array look-ups. 

Now we turn to the parameters I and ft. If I is too large, some queries may lie 
within an /-interval. If ft is too small, some queries may span across many 2^~^l- 
intervals. Both types of queries cannot be answered efficiently by BPM{A, I, ft). 
To eliminate these queries, we can put I = 2 and ft = logrz — 1. Given a query 
MAX(z, j), let w be the leftmost bit position at which the binary representations 
of z and j differ. (So 0 < z« < logrz — 1 if z 7 j-) Then [z/2™] -I- 1 = Lj/ 2™] and 
MAX(z, j) = ma,x{PRw[i], PLw[j]}- If z = j, then MAX(z, j) is simply asking for 
A\i] which is PLq[i\ if z is even and if * is odd. For the index calculation, 

w can be determined by taking the exclusive-or of the binary representations of 
i and j; and then computing the position of the leftmost ‘1’ by a look-up table. 

In summary, a BPM(A,2,logn — I) structure answers range max queries 
with 2 array lookups, uses 2n(logn — 1) storage cells and can be initialized in 
O(rzlogn) time. With the tradeoff technique in Section 2, it requires 2L array 
look-ups for a query, SL^rz^f^ array accesses for an update, 4rz(logn— 1) storage 
and 0(rz log zi) for initialization. 
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3.2 A Tradeoff between Query Time with Storage 

We can reduce the storage by having an ensemble of BPM structures with sui- 
tably chosen parameters. For the time being, we ignore the handling of updates. 
The idea, mainly due to [12], is to recursively construct a data structure for an 
array of size n which has query time t and storage kn. We denote by i?(t, k) the 
maximum n for which such a data structure is realizable by our construction. 

We first consider the base case which consists of several subcases. When 
only 1 array look-up is allowed, we use the workaholic approach mentioned in 
Section 1. Since we need to store n{n + l)/2 answers, we have i?(l, k) = 2k — 1. 
When at most 2 look-ups are allowed, we use BPM{A,2,logn — 1). Therefore, 
R{2,k) = [2^/^+^J. At the other extreme where k = 1 (i.e., no extra storage 
other than A itself), we use the lazy approach and therefore R{t, 1) = t. For 
simplicity, we choose R{t, 2) = R{t, 3) = t when t> 3. 

For the recursive case where t > 3 and fc > 4, we apply the recursion techni- 
que taken from [12]. Let a = R{t, k — 3). We classify the queries into two types, 
those lying within an a-interval (type 0) and those spanning across at least 
two a-intervals (type 1). We handle type 0 queries by recursively constructing 
R{t, k) /a data structures, one for each a-interval, with query time t and stor- 
age {k — 3)a each. In total, they consume {k — 3)o x R{t, k)/a = (fc — 3)R{t, k) 
storage. For type 1 queries, we separate the query range into (at most) three 
parts, the left and right parts containing incomplete a-intervals and the middle 
part containing zero or more complete a-intervals. We compute the maximums 
of the left and right parts in 2 steps by BPM {A, a, 1) which has 2R{t, k) storage. 
We compute the maximum of the middle part by a data structure for another 
array A' containing the maximum of each a-interval of A. Note that A' has size 
R{t,k)/a, and we are allowed t — 2 look-ups and a ■ R{t,k)/a storage for this 
data structure. Therefore, we choose R{t, k) ja = R(t — 2, a) and hence 

i?(t, k) = R{t, k — 3) ■ R{t — 2, R{t, k — 3)) 

Our formula is slightly different from that in [12], which is R{t, k) = R{t, k — 
6) • R{t — 2,2R{t,k — 6)) with an appropriate change of variables. Compared 
with theirs, we obtain better query time for the same assymptotic growth rate 
of storage. For instance, when t = 4, we obtain a data structure with 0(nlog* n) 
storage while the construction in [12] requires f = 5 in order to have 0(n log n) 
storage, and t = 7 to have 0{nlog* n) storage. Below, we give two tables com- 
paring our results. 
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(b) Value of R{t, k) in [12] 



Fig. 2. Comparision between our result and [12] ’s 





Orthogonal Range Queries in OLAP 



369 



3.3 Details for the t — 4 Case 

We now describe the details for index calculation, trading query and update 
costs, and the handling when n < R{t, k) for the chosen t and k. Assuming a 
blow-up factor of 0(log* n) in storage is tolerable in practice, we will concentrate 
on the case where t = 4. 

From previous tables and formulas, the blow-up factor, k, of storage increa- 
ses in steps of 3 in our construction (and 6 in [12]). However, observe that 
BPM{A, 2,1) can handle queries spanning across at most four 2-intervals in 
4 lookups. Similarly, BPM{A, 4,1) and A together can handle queries span- 
ning across at most four 4-intervals in 4 lookups. Thus we set R{4, 2) = 8 and 
i?(4, 3) = 16. Applying the previous recursive construction, we obtain a smoo- 
thier increases in the storage blow-up factor: 
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Unrolling the recursions, our structure is composed of groups of BPM struc- 
tures. Suppose the size of array A is n such that R{t, k — 1) < n < R{t, k) for 
some k. Assume fc is a multiple of 3. Then the number of groups is determined 
as follows. Define a function g{r) = R{4, k) where r = k/3. Then 

5(1) = 16 

g{r) = X g{r — 1) for r > 1. 

(If A: = 2 mod 3, we set r = (fc -|- l)/3 and 5f(l) = 8. If fc = 1 mod 3, we set 
r = {k + 2)/3 and (/(I) = 4. We omit these similar cases here.) Note that g{r) 
is a power of 2 for all r. With this change of variables, g(r) is the maximum n 
for which our recursive construction takes only r recursion levels. Define q(n) as 
the smallest integer, r, such that g{r) > n. It can be shown that y(n) < log* n 
where log* n = min{p j log*-^^ n < 2} and log*-^^ n is defined as: log^°^ n = n, 
log^^^ n = log(log^^“^^ n) for p > 0. Our structure will have 7(n) groups of 
BPM structures. For 1 < w < 7(n), we define the arrays Aw[H..n/ g{w) — 1] as 

A^[{] = max{A[j] | i ■ g{w) <j<{i+l)- g{w)}. 

Group 0, consisting of A and BPM{A, 4,1), handles queries within a g(l)- 
interval. Group 1 consists of BPM{Ai,2, g{l)/2) and BPM{A, g{l), 1) for que- 
ries within a g(2)-interval (which consists of 2 • ^(l)-intervals). In ge- 

neral, for 1 < w < 7(n) — 2, group w consists of BPM{A^,2,g{w)/2) and 
BPM{A,g(w), 1) which can handle queries within a g(w + l)-interval. For w = 
7(n) — 1, 2®*^“^/^ can be much larger than n and padding A with 2®*^“^/^ — n 
dummy elements would be a mistake. Thus we choose BPM (A^, , 2, log( |" ~ 
1) (instead of BPM{A^,2,g{w)/2)) and BPM{A,g{w), 1) for group w = 7(n) — 
1. (Note: g{'^{n) — 1) < n by definition). 
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Initialization and Storage: Group 0 BPM structures take 3n storage cells. 
For 0 < w < 7(n), group w BPM structures take + 2n = 3n 

cells. Hence the total space required is 3j{n)n. Initialization is also easy, using 
0(n^(n)) time. 

Query: To answer a query MAX(i, j), we find the smallest integer w such that 
\i/g{w + 1)J = Yj / g{w + 1)J. Then i, j fall into the same g{w + l)-interval 
but in different ^(r<;)-intervals. Therefore, MAX(i,j) = max{MAX(i, z' — 1), 
MAX(z',/ - 1), MAX(/,j)} where i' = \i/g{w)~\g{w) and / = [j /g{w)\g{w). 
For the first and last sub-ranges, we look up BPM{A,g{w), 1). That is, 

MAX(i,i' - 1) = PRo\i] 

MAX(j',j) = PLo[j] 

where PRq and PRq are arrays in BPM{A,g{w), 1). For the middle sub-range, 
we search for the range [i",j"] in BPM{Au,,2,g{w)/2) where i" = \i' /g{w)\ 
and j" = [(j' - l)/g(w)J . That is, 

MAX(z',/ - 1) = max{PLp_iogg(^)[z"],Pi?p_iogg(^)[/']} 

where p is the leftmost bit at which the binary representations of z and j dif- 
fers. Therefore, it takes at most 4 array look-ups. For the index calculation, we 
determine p as described in Subsection 3.1. For w, we check the range of p. If 
p < 4, then zc = 0. If 4 < p < 13, then u; = 1. If 13 < p < 4110, then w = 2, etc. 
This can be done in constant time by using a look-up table of size O(logrz). 

Update: To process an update, each BPM structure needs to be changed. Upda- 
ting BPM{A,g{w), 1) takes 0{g{w)) time for each I < w < 7(71) — 1. Updating 
BPM{Aw,2,g{w)/2) takes 0(2®^“’^/^) time for each 1 < w < 7(77) — 2. For 
w = 7(zz) — 1, updating BPM{Ayj,2,log{n/g{w)) — 1) takes 0{n/g{w)) time. 
Observe that < g{w) and that g{'^{n) — 1) < n. Therefore, the total 

worst case update time is 

7 (n) — 1 7 (n) — 2 

0( Y: 9(»)) + 0( y: 2«<-»7 + o( " ) = 0 (n). 

w^l 10=1 l\ J ) 

Using the technique in Section 2, updating BPM{A, g{w),l) takes 4Ln^/^ 
accesses for each 1 < zc < 7(77) — 1. Updating BPM{Ayj,2, g{w) /2) for 1 < zzi < 
7(77) — 2 and BPM{A^, 2,log(|~ ^^^^"^_^^ ]) — 1) takes accesses each. 

Hence it takes 12L^77^/^7(z7) accesses in total. Query now takes 4L look-ups 
and storage becomes 6777(77). 

4 Extension to Higher Dimensions 

In this section, we define a class of data structures called oblivious storage scheme 
and describe a technique to extend such data structures for 1-dimensional range 
queries to higher dimensions. 
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4.1 Oblivious Storage Scheme 

Informally, an oblivious storage scheme is a data structure in which the set of 
storage cells to be examined or changed is determined by the query region or 
update position rather than on the values in the array. Similar concepts were 
introduced in [17,29,11]. Here we describe a definition suitable for our purpose. 
Let A be an array over a commutative semigroup G. An oblivious storage scheme 
for A is a triple, {B, Q,U), where 

1. H is an array of storage cells containing elements in G, 

2. Q is a set of programs, one for each query region, and 

3. W is a set of programs, one for each update position. 

The storage cost of the scheme is the size, m, of B. For each query region, R, 
the corresponding program in Q is a sequence of integers, (/io, ■ . ■ , iim-i), such 
that 



A[r] = g,oB[0] H h ^m-iB[rn - 1] 

rdR 



where ‘+’ is the addition operation in G. When answering a query, extra tem- 
porary storage may be needed to evaluate the expression. However, we do not 
charge it towards the storage since they are temporary. When evaluating the ex- 
pression, those terms with /x’s equal to 0 need not be added. Thus, the number 
of non-zero /i’s is taken as the query cost. In general, we require all the /r’s to 
be non-negative. However, if G is a group, then we also allow negative /i’s. 

For each update position, r, the corresponding program in U consists of a 
sequence of instructions in one of the following forms: (i) B[j] = new value for 
A[r], (ii) B[j] = Aoi?[0] -!-••• Xm-iB[m — 1] where the A’s are integers if G is 
a group, and non-negative integers if G is a semi-group. The total number of 
non-zero A’s in all the instructions is taken as the update cost. 

Our prefix and range max structures, the prefix sum structure of [22] and 
the relative prefix sum structure of [18] are all oblivious storage schemes. On 
the other hand, the combination of Cartesian tree [26] and the nearest common 
ancestor algorithm [20] for 1-dimensional range max is not. 



4.2 Combining Oblivious Storage Schemes 

Let S and be oblivious storage schemes for a 1-dimensional array of size n 
and a {d— l)-dimensional array of size respectively. We can combine them 
into an oblivious storage scheme, 5^, for a d-dimensional array A of size as 
follows. Suppose S requires m storage cells and Sd-i requires rrid-i storage cells. 
Then we will make use of two arrays of storage cells, G[0..m — 1, — 1] and 
B[0..m—1, 0..md-i — 1]. For each position r in dimension 2 to d, we follow S and 
construct a storage scheme for the subarray A[0..n — 1, r] using G[0..m — 1, r] as 
the storage cells. Next, for each position i in dimension 1, we follow Sd-i and 
construct a storage scheme for the subarray G[z, — 1] using B[i, 0..nid-i — 

1] as the storage cells. The total storage of the new scheme is + rrid-i)- 




372 



C.K. Poon 



Note that if the storage cells of Sd-i and S includes a copy of the original array, 
then the array C need not be stored in Sd- Then the storage becomes mrud-i- 
Let the query program for region R 2 x ■ ■ ■ x Rd in Sd-i be (/xq, . . . , 
and that for Ri in S be ( 779 , • . • , rym-i). Then the query program for Ri x ■ ■ ■ x Rd 
in S can be derived as follows: 

= ^iVoC[0, r] H h - 1, r]) 

i,r r 

= %(Mo-B[ 0, 0] H h rud-i - 1]) 

H h 77m-i(/io^[wd-i,0] H h ^imi.l-lB[md-l - l,m - 1]) 

If there are p and q non-zero iq's and /x’s respectively, there will be pq non-zero 
{ripYs. Hence the query cost of S is the product of that of Sd-i and S. 

Similarly, the update program for (i,r) in S can be derived as follows. Let 
U and Ud-i be the update programs for i in 5 and for r in Sd-i respectively. 
Furthermore, let the set of indices of storage cells that appeared on the left hand 
side of U be {ji, . . . , j;}. These are the storage cells of S that are updated by U. 
Then the update program for (z,r) executes the program U on C[0..m — l,r], 
followed by the program Ud-i on B[ji,0..m,d-i - 1], • • •, B[ji,0..md-i - 1]. 
(That is Ud-i is executed I times.) Thus, if the two programs have cost t and 
td-i respectively, then the new program have cost t + td-it = t(l -I- td-i)- 

Applying this composition technique repeatedly to a 1-dimensional structure 
with query time tq, update time and storage m which does not contain the 
base array A, we obtain a d-dimensional structur with query time tq, update time 
at most and storage dm'^. If the structure contains A, the storage is only 
Applying to our 1-dimensional range max structure, we obtain a d-dimensional 
range max structure with 0((4L)‘^) query time, 0((12L^n^/^7(n))‘^) update time 
and 0((6nj(n))‘^) storage. Applying to our 1-dimensional range sum structure, 
we obtain a d-dimensional range sum structure which has 0((2L)‘^) query time, 
0((2Ln^^^)'^) update time and 0((2n)‘^) storage. 

5 Conclusion 

We have designed efficient data structures for range sum and max queries. The 
range sum structure generalizes that of [22,18,17]. Our range max structure has 
constant query time and almost linear space It handles updates more efficiently 
than [12] which is basically a static structure. It would be nice to have a truly 
linear space and constant time range max structure, or a proof that linear space 
is impossible. 
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Abstract. In OLAP applications, data are modeled as points in a mult- 
idimensional space. Dimensions themselves have structure, described by 
a schema and an instance; the schema is basically a directed acyclic 
graph of granularity levels, and the instance consists of a set of elements 
for each level and mappings between these elements, usually called rol- 
lup functions. Current dimension models restrict dimensions in various 
ways; for example, rollup functions are restricted to be total. We relax 
these restrictions, yielding what we call heterogeneous schemas, which 
describe more naturally and cleanly many practical situations. In the 
context of heterogeneous schemas, the notion of summarizability beco- 
mes more complex. An aggregate view defined at some granularity level 
is summarizable from a set of precomputed views defined at other levels 
if the rollup functions can be used to compute the first view from the set 
of views. In order to study summarizability in heterogeneous schemas, 
we introduce a class of constraints on dimension instances that enrich 
the semantics of dimension hierarchies, and we show how to use the 
constraints to characterize and test for summarizability. 



1 Introduction 

The multidimensional model is becoming increasingly important as a logical 
layer for visualizing and querying data in OLAP scenarios. A key aspect of 
multidimensional data is the separation of factual and dimensional data. While 
dimensions represent descriptive and relatively static data, facts depict event- 
based data, represented as points in spaces defined by dimensions. 

A number of multidimensional models for OLAP [CT97][HMV99a][LAW98] 
[JLS99] have recently incorporated dimensions as first-class entities in query and 
update languages. In the logical layer, a dimension is composed of a schema and 
an instance. The dimension schema includes a directed acyclic graph (DAG) of 
levels, called hierarchy schema, where levels may have attributes associated with 
them. Levels can be viewed as foreign keys of tables containing values for the 
attributes. 

On the other hand, a dimension instance consists of a set of members for 
each level, called member sets; and a hierarchy relation that models the an- 
cestor/descendant relation between the members. For instance, we may have 
Toronto as a member of a level City in a dimension representing locations. In 
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most of the models [CT97] [HMV99a] [HMV99b], the hierarchy relation is re- 
presented by a set of functions between member sets, called rollup functions. 
Usually, we say that a level li rolls up to another level I 2 when there exists an 
edge from /i to I 2 in the hierarchy, meaning that there is a rollup function from 
to ? 2 - 



1.1 Heterogeneous Dimensions 

Suppose we have a dimension representing stores in Canada and the USA. While 
the stores in Canada roll up to cities and to provinces, the stores in the USA 
roll up to cities and to states. Figures 1 (B), and (C) show two possible dimen- 
sion hierarchies for the location dimension (we abstract away the attributes); 
a possible instance is depicted in Figure 1(D). Notice that in the schemas (B) 
and (C) the rollup functions are total. An alternative schema for representing 
this dimension is depicted in Figure (A). However, here we have that the rollup 
functions between City and State, and between City and Province are partial, 
because some cities do not have states while others do not have provinces. 

At this point, we need to introduce some terminology that will be central 
in this paper, and which we formalize in the next section. A dimension schema 
is homogeneous if for every pair of levels h and I 2 such that h rolls up to I 2 
we have that in every dimension instance conveyed in the schema, the rollup 
function is a total function from the member set of to the member set of ^ 2 - A 
dimension schema is strictly homogeneous if it is homogeneous and it has only 
one level that is at the bottom of the hierarchy schema; a dimension schema is 
heterogeneous if it is not homogeneous. 

Coming back to our example, the schema (A) is heterogeneous becuse in 
the instance of Figure 1 (D), the rollup function between City and State is a 
partial function from {NewYork, Toronto} to {NYstate}. On the other hand, 
the schema (B) is homogeneous, and the schema (C) is is strictly homogeneous. 
The choice between schemas (A), (B), or (C) depends on factors like the at- 
tributes that the levels share, and how we would like to group members into 
levels in the dimension. The schemas that allow heterogeneity could be better 
in many situations. As an example, if we restrict the schemas (B) and (C) to 
be homogeneous, the insertion of the stores in Washington D.C. would require 
a new sub-hierarchy StoreWashington- City Washington below the level Country, 
where City Washington contains, in any instance, only one member (because 
Washington is the only city in USA that does not have a State.) Modeling 
heterogeneity allows the fusion of levels that represent the same granularity of 
aggregation, reducing the complexity of the schema and, therefore, of query for- 
mulation. The benefit of having a more flexible model at the logical layer can 
also be propagated to the storage layer. For instance, heterogeneity allows the 
fusion of levels that share attributes, while keeping separate levels with different 
attributes. 

We end this section by noting that current dimension models [CT97] 
[HMV99a] [HMV99b] [JLS99] do not allow heterogeneity. 




Heterogeneous Multidimensional Schemas 377 




All 



try 



Province 

\ 



C|)untry 

PoliticalDiv. 



USA Canada 



Ny^tate Ontario 

New York Toronto 



City 


CityUSA 


CityCanada 


City 


\ 


J 


StoJeUSA 


StordCanada 


Store 


\ 

si 



(A) 



(B) 



(C) 



(D) 



Fig. 1. (A) (B) (C) Three alternative dimension hierarchies for the location dimension. 
(D) An instance of the location dimension. 



1.2 Summarizability 

Data cubes [GBLHP96] comprise the computation of a set of aggregate views, 
called cube views, which represent facts aggregated at different granularities (set 
of levels taken from a set of dimensions) . Different fundamental functionalities 
for OLAP, like aggregate navigation, cube computation, and cube maintenance 
[CD97], among others, require the derivation, using pre-defined aggregate que- 
ries, of cube views from other pre-computed cube views, transparently to the 
user. In order to do this, the system must determine which derivations are cor- 
rect; in other words, whether a cube view can be derived from other cube views 
through predefined aggregate queries. 

The notion of summarizability [RS90] refers to the conditions under which 
we can correctly derive any cube view defined at a level I 2 from a cube view 
defined at a level h by aggregating using the rollup functions between and I 2 
of a particular dimension. In order to allow summarizability the data cube must 
be distributive, i.e., its aggregate functions must be distributive [GBLHP96]. A 
distributive aggregate function af can be computed on a set by partitioning the 
set into disjoint subsets, aggregating each separately, and then computing the 
aggregation of these partial results with another aggregate function^ that we 
denote by o/'^. 

The central problem we address in this paper is: how we can infer summa- 
rizability in distributive data cubes from the dimension schema, without having 
to analyze the dimension instance? In particular, it has been shown [HRU96] 
that, in strictly homogeneous dimensions, correct summarizations correspond to 
the edges of the dimension hierarchy, as depicted in the following example. We 
give the formalization using the relational algebra with bag semantics, extended 
with the generalized projection operator to express aggregation [AGS+96]. The 
generalized projection operator, IIa, is an extension of the duplicate-eliminating 

^ Among the SQL aggregate functions, COUNT , SUM , MIN, and MAX are distribu- 
tive. We have that COUNT‘D = SUM- for SUM, MIN, and MAX we have af‘^ = af. 
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projection, where A can include both regular and aggregate attributes. For sim- 
plicity, all the aggregates of cube views are assumed to be on a single attribute. 

Example 1. Consider the dimension with schema depicted in Figure 1 (B). From 
the schema we can infer the correctness of the summary operation that com- 
putes the total sales per province from the total sales per cities in Canada: 

nproviuce,Sum{Saies){Sales.CityCan IXI where E^ltyclu represents 

the rollup function from the cities of Canada to the provinces. 

In this paper we extend the notion of summarizability, previously introduced 
for dimensions over strictly homogeneous schemas, in order to consider deriva- 
tions from sets of levels to levels in dimensions over dimension schemas that are 
not strictly hierarchical. The following example shows the extra complexity of 
inferring summarizability in heterogeneous dimension schemas. 

Example 2. Consider the hierarchy schema of the location dimension of Figure 
1 (A). Consider the cube views Sale-Province, and SaleState representing the 
total sales per province and state, respectively, and defined as follows: 

Sale-Province = El province, Su7n{Saies){Sales-Store cc 
Sale-State = nstate,Sum(Saies){Sales-Store 

Consider the following aggregation that derives the total sales per Country 
from Sale-Province, and SaleState: 

C ountry ,S um{S ales) i.EI C ountry ,Sum{S ales) {Sales-Province to W 

{EIcountry,Sum(Sales)i,SaleS -State tO 

where Pjf represents the rollup function from li to I 2 ', and l±l represents the 
additive union which adds the multiplicity of the tuples. Intuitively, we could say 
that this derivation is correct, because there are no stores that roll up to a state 
and to province at the same time. However, we cannot infer the correctness of 
this derivation from the schema (A), because it does not capture precisely the 
possible set of instances we are modeling; in particular, it does not explicitly 
disallow a store that rolls up to both a state and a province. 

The question that arises at this point is: what additional constraints we 
have to add in order to keep the ability to reason about summarizability in 
heterogeneous schemas. 

1.3 Contributions and Outline 

In this paper we introduce a dimensional model that accounts for heterogeneity. 
We define the notion of summarizability for heterogeneous dimensions, which 
smoothly extends the notion of summarizability for homogeneous dimensions. 
We identify four classes of dimension schemas, that go from strictly homoge- 
neous to a class of schemas we call hierarchical schemas, that allow heteroge- 
neity but keep a notion of ordering between the granularities defined by levels. 
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We introduce a class of constraints, split constraints, that enrich the semantics 
of dimension hierarchies. Finally, we solve the problem of deciding summariza- 
bility in hierarchical dimensions constrained with split constraints. The solution 
is obtained from a sound and complete axiomatization of a subclass of split 
constraints. 

The remainder of the paper is organized as follows. Section 2 introduces the 
model for dimensions, along with the formalization of the notion of summarizabi- 
lity. Section 3 introduces the new constraints we propose. The inference problem 
for summarizability is studied in Section 4. In Section 5 and Section 6 we show 
related work, conclude and outline some of the prospects for future work. 



2 Heterogeneous Dimensions 

In this section, we describe our framework for modeling dimensions. A dimension 
schema will consist of a hierarchy schema and a set of constraints. 



2.1 Hierarchy Schema 

We define a hierarchy schema in the same fashion as in [JLS99] and [HMV99a], 
but allowing heterogeneity, multiple hierarchical paths, and multiple bottom 
levels. Multiple hierarchical paths are frequently required as argued in [HMV99a], 
[CT97] . Allowing multiple bottom levels makes it possible to have more natural 
schemas in several situations as shown in [JLS99]. 

Consider a set of members E, a set of levels L, and set of attributes A. 

Definition 1 (Hierarchy Schema). A hierarchy schema is a tuple G = {L, yA 
,A,a), where A C L zs o set of levels with a distinguished level All; is a 
binary relation on L such that conforms a rooted DAG with All as root 

(we denote by yA* the transitive closure of yA); A G A is a set of attributes; and 
a : L ^ 2-^ assigns to each level a set of attributes. 

We refer to the set of bottom levels of a hierarchy schema, {I G L \ -•31' G 
L : I' yA 1}, as LBottom- Given two levels la, h G L, we denote by the set of 
paths between la and lb in G. 

2.2 Dimension Instance and Schema 

An instance of a dimension is obtained by specifying a set of members for each 
level, along with the descendant/ancestor relation < between them. 

Definition 2 (Dimension Instance). A dimension instance is a tuple (G, S, < 
,T), where G is a hierarchy schema; £ is a set of sets of members in E, one set 
£i for each level I G L, where we denote by £d the union of the member sets of 
a dimension d; < is a relation between members eonforming a rooted DAG with 
root all, where we denote by « the transitive closure of <; and finally, T is a 
set of relations that contains one relation Ti, with attributes a{l) U{?}, for every 
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level I € L. The following conditions hold: (1) the member set of a level I, Si, is 
the active domain of I in Ti, and I is a key ofTr, (2) for every pair of members 
6a, ei, such that € Si^, ej, € Si,^ and Ca < ej,, we have la h, and there are no 
members ei, . . . ,e„ G Sd such that Ca < ei <...€„< ei,; (3) for every member 
e € Sd such that e ^ all we have e « all. 

The first condition says that the relation for level I contains exactly one tuple 
for each element in Ei. The second condition essentially says that the edges in 
the dimension hierarchy represent links between levels, that must exist whenever 
we have a direct descendant/ancestor relation between some pair of members in 
the levels. And finally, the last condition states that all the members reach the 
top member all. Note that the member sets are not necessarily disjoint. 

Given a dimension d, a leaf member is a member e £ Sd with no descendant 
members. An important feature of the model is that it may have leaf members in 
non-bottom levels. This allows updating the dimension, as shown in [HMV99b] 
(for instance, we might want to add a city but do not yet have stores that belong 
to it.) However, in order to simplify the presentation, we make two assumptions: 
(a) all the leaf members belong to the bottom levels; and (b) the member sets of 
the bottom levels are pairwise disjoint. We define a base level, denoted by Itase, 
containing the union of the member sets of the bottom levels. The results in this 
paper can be extended to dimensions where (a) and (b) does not hold by a more 
detailed treatement of base members, which basically consists in defining them 
as id’s of the leaves. 

Definition 3 (Rollup Operators). Given a dimension instance h, we define 
the direct rollup operator, that takes a dimension and two of its levels h, I 2 and gi- 
ves a relation with attributes h and I 2 defined as follows: dr'f^ = {(xi,X 2 ) \ x\ € 
Sii A X 2 € S 12 A xi < X 2 }- We have the rollup operator with the same signature 
of dr which gives a relation with attributes la and I 2 defined as follows: tI^ = 
{{x\,X 2 ) I x\ G Si„ Ax 2 G Si,^Ax\ « X 2 }. The base rollup operator takes a level 
I a defines the relation, with attributes hase and I, that groups the base elements 
to it, and is defined as follows: = {{x,y) \ x G ^has„ Ay G El Ax « y}. 

The following are some basic properties of the rollup operators: given a di- 
mension instance d we have: (a) if ~<{la lb) then dr\'‘ = 0; (b) if ~<{la lb) 
then = 0; (c) if Yi^^i, = {IJb} then Tj^^ = dT\\ 

Definition 4 (Partitioned Instances). A dimension instance is partitioned 
when all its rollup relations are single valued (partial functions). The partitio- 
ning property appears as an inherent constraint in the the dimension models of 
[CT97], [HMV99a], [LAW98], and [JLS99]. It requires that each member in the 
base level reach, through not more that one member in each level 1. In 

this sense, each levels represent partitioned classifications of the base members. 
In the sequel we assume that all dimension instances are partitioned. 

We are ready to define dimension schema as a hierarchy schema plus a set of 
constraints in some constraint language CL. In Section 3 we introduce a specific 
CL, the language of split constraints. 
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Definition 5 (Dimension Schema). A C£-dimension schema is a tuple ds = 
(G, E) where G is a hierarchy schema; and E C CC. 

The constraint language must have a notion of satisfaction, denoted by \=cc ■ 
A dimension instance d is over a dimension schema ds if Gd = Gds, d \=cc E. 
Given a dimension schema ds we denote by I{ds) the set of dimension instances 
that are over ds. 



2.3 Classes of Dimension Schemas 

The following definition formalizes the classes of dimension schemas we mentio- 
ned in Section 1. 

Definition 6 (Homogeneous and Heterogeneous Dimension Schemas). 

A dimension schema ds is homogeneous if every dimension d over ds satisfies: 
for every pair of levels li, I2 such that li I2 we have that : £1^ — >■ is a 

total function. A dimension schema is strictly homogeneous if it is homogeneous, 
and has a single bottom level. A dimension schema is heterogeneous if it is not 
homogeneous. 

In a homogeneous dimension instance, the ordering of the levels provided 
by the graph is exactly the same as the ordering of the grains represented by 
the levels. In other words, as we move up the hierarchy schema we reach levels 
that represent coarser partitions of the base members. Furthermore, the rollup 
functions capture precisely the containment relation between the partitions of 
levels connected in the graph. These two properties can no longer be true in 
heterogeneous dimension instances. 

In order to formalize this intuition, let us define the notion of overlap, as 
introduced for classification schemas in statistical databases [Mal93] . The overlap 
between two levels h and I2 is a relation O’jf with signature x £1^ that has 
an edge between the members ei and 62 iff the intersection of the sets of base 
members that reach them is non-empty. The overlap between two levels can be 
defined with the following relational-algebra expression: 

). We say that li ^ I2 in a, dimension instance d if That is, 

h A I2 means: (a) the grain of h is finer than the grain of I2 because the overlap 
is a function (possibly partial); and (b) the rollup function captures precisely 
the containment relation between the grains of l\ and I2. 

Example 3. The dimension instance of Figure 2 satisfies Gity ^ State, i.e., there 
is no City c and State s such that they overlap and c does not roll up to s. Another 
way to think of the fact that Gity ^ State is as follows: if there is a store that 
rolls up to a city c and to a state s, then c rolls up to s. 

It is important to note, that in general E is not transitive, i.e., if a dimension 
instance h satisfies h E h, and I2 A h it does not necessarily satisfy h E h- 
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Example 4- Consider the dimension instance depicted in Figure 2. Although we 
have City ^ SaleDistrict, and SaleDistrict ^ SaleRegion, we do not have 
City :< SaleRegion, because two stores which roll up to NewY ork, such as Si 
and S 2 , roll up to two different sale regions, ri and r 2 - 



Definition 7 (Hierarchical Dimension Schemas). A dimension schema ds 
is hierarchical if for every pair of levels h,l 2 G Eds such that l\ I 2 we have 
that every dimension instance d € I{ds) satisfies h ^ h- A dimension schema 
ds is strictly hierarchical if for every pair of levels l\,l 2 G Ads such that l\ /^* I 2 
we have that every dimension instance d G I{ds) satisfies h d: h- 



All 



Country 





(A) (B) 

Fig. 2. (A) A hierarchy schema of a Location dimension; (B) A dimension instance for 
(A). 



The relationship among the classes of schemas is given by the following chain 
of inclusions: Strictly Homogeneous C Homogeneous C Strictly Hierarchical 
C Hierarchical . 

2.4 Summarizability 

In this section we extend the notion of summarizability to heterogeneous dimen- 
sions. A level represents a granularity of aggregation. In this sense, a level I can 
be associated with a one-dimensional cube view with no select conditions, that 
has the following form: AI; dixi F), where d is a dimension; F is a 

relation, called a fact table, with a special attribute m called the measure, and 
with Ibase included in its attributes; af is an aggregate function; and I is a level. 
This cube view will be abbreviated as cv{d, F, I, af{m)). 
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Sumarizability of levels in a dimension is related to the correct derivation of 
one dimensional cube views from other one dimensional cube views using rollup 
relations for grouping. Such derivations involve queries of the form: 
ni,af={m){\Siei...n(^i,mrl.d cc Fi)). Intuitively, we are aggregating a set of fact 
tables Fi, ... ,Fn, using the rollup mappings , . . . , Fl^ . 

Definition 8 (Summarizability). Given a dimension instance d, a set of le- 
vels L = {li, . . . ,ln\, and a level I, I is summarizable from L in d if for every 
fact table F , and distributive aggregate function af , we have: cv{d, F, /, af{m)) = 
^i,a/'=(m)(l±Jigi cv{d, F, li, af {m)))) Given a dimension schema 

ds, a set of levels L = {li, . . . , In}, and a level I, I is summarizable from L in ds 
if I is summarizable from L in every instance d in I{ds). 



Example 5. In the dimension instace of Figure 2, we have that Country is sum- 
marizable from {SaleDistrict, State}, but it is not summarizable from 
{SaleRegion, State}. 



The following lemma gives an alternative definition of summarizability. 



Lemma 1 (Summarizability). A level I is summarizable from a set of levels 
L = {h, ... ,ln} in a dimension instance d iff rj;^ = Wiei n'^hase.ii^i} 

rO 



Recall that we are assuming that the dimension is partitioned. From Lemma 
1 we have that a level I is summarizable from a single level /i in a dimension d 



iff r/ 



= f; 



^bas 






3 Split Constraints 

In this section we propose a class of constraint we call split constraints. The 
intuition behind them is that usually there are dependencies between the rol- 
lup functions that start from a common level. For example, we could say that 
the cities that roll up to states do not roll up to any province in the location 
dimension of Example 1 . 

To start, a split expression a{x) for level I is a propositional formula (with 
the usual connectives -i, A, V, D, <J4>, and © for exclusive disjunction) over atoms 
of the form 3xi : fI'{x, Xi),or 3xi : dF\'{x, Xj). Note that all the atoms represent 
rollup from a common level I, and the only free variable is x. We use T and T 
to denote the false and the true proposition respectively. 

Definition 9 (Split Constraints). A split constraint for level I is an expres- 
sion \fx € Si : a{x), where a{x) is a split expression for level 1. A dimension 
instance d satisfies a split constraint s, denoted d \= s, if s is true when each 
fact rl* (x, Xi) and dF\' (x, Xi) is interpreted respectively by the direct and indirect 
rollup relations of d. A base split constraint is a split constraint whose atoms all 
rollup from hase- 
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Example 6. The split constraint: 3xi : rl^{x,x\) D 3x2 : T/^(x,X 2 ) says that if 
a member e in I rolls up to a member ei in li, then e rolls up to a member in 12 - 

Note that, if we associate the variables of the atoms with the level that they 
bind to, abstracting away the arguments and the existential quantifiers in the 
atoms, we can write split constraints in an abbreviated form. For example, the 
split constraint in the last example can be written as V 

Example 7. We can impose the following set of split constraints over the skeleton 
of Figure 2 (A). = {(a) (&) (c) 

/ 7\ j~\Country / \ -pCountry\ 
yd) I galeRegion'^ v^/ ^ State J ’ 

The following example depicts the use of direct rollup atoms in a split con- 
straint. 

Example 8. The split constraint: dE^^fy D says that if a city rolls up 

directly to a state it does not roll up to any county. 

We can characterize, using base split constraints, structural anomalies that 
can be present in a schema. A level could be empty, i.e., I could be constrained to 
not have any members in any instances of the schema; formally, this is captured 
by -'El . Even if we have a pair of levels h,l 2 such that li I 2 , the rollup 
function between them could be empty in all the instances of the schema; in this 
case we have ->eI^ V -iE!^ . Finally, we can have a dimension schema that 

is unsatisfiable; this can be characterized as for every I G Itase- As an 

example, if we imposed to the schema of Example 7 SaleDistrict would 

be an empty level. 

4 Inferring Summarizability 

In this section we give an algorithm for testing summarizability in the dimension 
instances defined by a given hierarchical dimension schema, where the constraint 
language consists of split constraints. Therefore, unless otherwise stated, we 
assume that the dimension schemas mentioned are hierarchical and CC is the 
class of split constraints. 



4.1 Conditions for Summarizability 



In this section, we show that the problem reduces to testing inference of base 
split constraints. Given a dimension schema ds, and two of its levels, h,ln, we 
define the following split expression: 



^h,ln =def 



. T otherwise. 

Note that if h E In then Z\!" is equivalent to E,^ A Eh , and if there is no 

‘■base '■base 

path from li to In it is equivalent to T. The intuition behind the above expression 
is that if we have then every base member Cb reaches an element e„ in 
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In passing through some element ei in li. For instance, if we state 
then every base element reaches an element in I passing trough l\, or through I 2 , 
but not through lx and l 2 - Note that the above statements do not always hold 
for non hierarchical dimension schemas. 

Now, we give a lemma that characterizes summarizability in terms of a base 
split constraint. 

Lemma 2. A level I is summarizahle from a set of levels L = {li , . . . , ?„} in a 
hierarchical dimension d iff d\= A {Ai^^i © ... © Ai^^i). 

The intuition behind Lemma 2 is that in order for I to be summarized from 
L, we need that every base member e;, that reaches a member e in I, reaches e 
passing trough one and only one of the levels in L. 

Example 9. Consider the Location dimension depicted in Figure 2. Assume that 
it is hierarchical, and the base level is denoted by Store. We have that Country 
is summarizable from {SaleDistrict, State} iff Location ^ A 

( ( rSaleDistrict a pSaleRegion . r,Country\ „ (instate a j-,Country\\ 

Storeld A 1 storeld ^ d storeld > ® C Store/d ^ Store/d )>■ 

the above split expression can be simplified to: EgfffffYf' D ^ 

r,SaleRegion\ „ restate \ 

Storeld / ^ Storeld)' 

4.2 Derivation of Base Split Constraints 

In the previous section, we found necessary and sufficient conditions for sum- 
marizability in terms of base split constraints. In this section, we give a sound 
and complete set of inference rules for implication of base split constraints. The 
notion of implication is stated as follows: given a dimension schema ds and split 
constraint a, we say that ds |= a if for every dimension instance d G X{ds) we 
have d\= a. We denote by Eg the set of base split constraints {(i \ ds \= (}}. 

The first two derivation rules we introduce reflects the assumptions (a) and 
(b) in Section 2.2, respectively. 

Rule 1 For every level I G L \ LBottom we have ^ V{Zi | it/'i} 

Rule 2 We have e!' 

Rule 1 states that there are no leaves in the internal levels, and Rule 2 says 
that every base member reaches one and only one member in a bottom level. 
The next rule reflects the condition (3) of Definition 2. 

Rule 3 for every level I G L such that I yf All, we have E^^^ . 

Next, we show how to transform split constraints into equivalent base split 
constraints. We will do this by expressing atoms in terms of base rollup atoms. 
We need the following definition: given a dimension schema ds and two of its 
levels la and ly we have: 
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rh A rl ‘ 

^base ^base 

_L 



-lib 



—def 

^ha.e ^ ^ha.e ^ V/„ . . ,J„ /, ,,, A ... A If {Ub} C 

iiTi^M={lak} 
otherwise 

For instance, the base split constraint states that every base element 

reaches an element Cb in lb, passing trough an element Ca in la that reaches 
directly Cb- 

Now, we give a characterization of the relationship between split and base 
split constraints. 



Lemma 3. Given a dimension schema ds, and a split constraint a, then for 
every hierarchical dimension d over ds we have d\= a iff d\= D a' , where 

a' is obtained from a by replacing every indirect rollup atom F/* with and 

replacing every direct rollup atom dr\' with dAig.. 

The following rule that transforms split constraints to base split constraints 
comes from Lemma 3. 



Rule 4 Given a split constraint a, then we have A a' , where a' is obtained 
from a by replacing every indirect rollup atom Pj' with Api^, and replacing every 
direct rollup atom dr’f with dAii. . 

The last rule we give shows that base split constraints can be derived from 
other base split constraints using propositional logic derivation. Let \=prop be the 
propositional implication of two base split constraint when considering rollup 
atoms as propositional variables. 

Rule 5 If we have the split constraints «i , . . . , and ai, . . . ,a„ \=prop (d, then 
we have j3. 

Finally, we show that the given axiomatization is sound and complete for 
deriving 

Theorem 1 (Soundness and Completeness of Base Split Inference). 

Rules 1, 2, 3, 4, and 5 are sound and complete for Sg. 



Example 1 0. Consider the base split constraint that we have to test in Example 9 
in order to decide whether Country is summarizable from {SaleDistrict, State}: 
(*) rgZTi? A A ® riZZid)- Now, we give a deriva- 

tion for (*). 

(1) r: 



City _ ]~>SaleDistrict 
Storeld Storeld 



'nState 
^ Storeld 



fc)\ 'p SaleDistrict 
\^) ^ Storeld 



, pState 
Storeld 



Rule 4 from split constraint 

(b) 

Rule 5 from (1) and split con- 
straint (a) 

(3) ^ A Rule 5 from split constraint (c) 

Rule 5 from (2) and (3) 
sequence of applications of Ru- 
les 1 and 5 

Rule (5) from (4) and (5) 



aleDistrict ^ 
Country 



(4) (-A'liore/d 

/r\ -pCountr' 
^ Storeld 

{*) 



Storeld 
SaleRegion\ 
Storeld / ' 



Storeld 
. pState 
^ Storeld 
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4.3 Complexity 

The following theorem gives a lower bound for the problem of inferring summa- 
rizability from a given dimension schema. 

Theorem 2. Given a dimension schema ds, a level I and a set of levels L, 
whether I is summarizable from L in every dimension d € I{ds) is coNP-hard. 

Proof. (Sketch) Transformation from VALIDITY. 

An upper bound of 0(n^2"), where n is the number of levels of the hierarchy 
schema, can be obtained by the algorithm in Figure 3. The upper bound is 
basically caused by the steps (2) and (3) of the algorithm, which are in 0(n^2”). 



Input: A heterogeneous dimension schema ds, a level I G Lds, and a set L C Lds 
Output: Whether I is summarizable from L in all the instances of ds 

(1) Apply Rules 1 and 2, and 3, having Si 

(2) Apply Rule 4 to every split constraint in Si and S, giving Sbs- 

(3) For every level h G L compute 

(4) If Sbs \=prop D 0 ... 0 then return true; else return false 

Input: A homogeneous dimension schema ds, a level I G Lds, and a set L C Lds 
Output: Whether I is summarizable from L all the instances of ds 

(1) Let L' = Ln{l' \ I' /<* 1} 

(2) If every bottom level that reaches I reaches exactly one level in L' then 
return true; else return false. 



Fig. 3. Algorithms for testing summarizability: (Above) in a hierarchical dimension 
schema with split constraints, (Below) in a Homogeneous Dimension Schema without 
empty levels. 



It is easy to show that testing summarizability in strictly homogeneous sche- 
mas is in polytime; we only have to check whether there is exactly one level 
in L that reaches I in the dimension hierarchy. The following lemma shows that 
testing summarizability in homogeneous schemas without empty levels is also in 
polytime. 

Lemma 4. Testing summarizability in an homogeneous schema without empty 
levels is in polytime. 

Proof. (Sketch) A polytime algorithm is given in Figure 3. 



5 Related Work 

Cabibbo and Torlone [CT97] introduced one of the first formal models for OLAP 
dimensions; it allows representing only strictly homogeneous dimensions. Further 
models [LAW98], [HMV99a] have the same restriction. Jagadish et al. [JLS99] in- 
troduce in their model the ability to represent non-strict homogeneous schemas. 
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referring to them as unbalanced. Although they motivate the need for modeling 
heterogeneity, their model does not go beyond homogeneity. The model of Leh- 
ner et al. [LAW98] considered a case of heterogeneity arising between levels and 
attributes, that has a trivial treatement using null values. 

The notion of summarizability was introduced by Rafanelli and Shoshani 
[RS90] as a property of statistical objects. Lenz and Shoshani [LS97] give condi- 
tions for summarizability of a level from a single level. Using a different, rather 
informal framework, they basically state that if (a) l± I 2 , (b) the rollup re- 
lations are functions, and (c) they are total between member sets, then l\ is 
summarizable from I 2 in a dimension d. They refer to condition (b) as the rollup 
relations being disjoint, and condition (c) as being complete. Assuming that the 
dimension is partitioned (or disjoint in their terminology), the above conditions 
are basically equivalent to the special case of Lemma 1 when summarizability is 
from a single level. 

Split constraints have the flavor of the disjunctive existential constraints 
(dec’s) introduced by Goldstein [G0I8I], which basically specify where null values 
may occur in a relation. Let us denote by the relational table that represents 
the dimension instance d realized as a single table in the star schema. Intuitively, 
Td is the relation that has as attributes the levels L U {Ibase}, and is defined by 
the outer-join of the base rollup relations. It easy to show that the partitioned 
property of a dimension d is equivalent to having Ibase as a key for Td. Basically, 
a dec says that whenever a tuple is non-null for an attribute, it must be non- 
null for all the attributes in at least one set of attributes in a given list. Only a 
subset of bottom split constraints can be represented using dec’s over Td', split 
constraints are more general because they allow any propositional condition on 
non- null attributes. 



6 Conclusion 

The restriction of homogeneity in current dimension models leads to unnatural 
dimension schemas in many real situations. The relaxation of this condition, ho- 
wever, weakens the ability to reason about summarizability in the schema. We 
identify a class of constraints that overcome this problem and enrich the seman- 
tics of dimension schemas. We solve the the problem of inferring summarizability 
in a class of heterogeneous schemas. The problem is still open for non partitio- 
ned, and heterogeneous schemas in general. The exploration of further classes of 
constraints and their relation with summarizability is also a challenging problem 
to pursue. 

It is interesting to note that homogeneous dimension schemas convey data 
in exactly the same way as hierarchical data models. Furthermore, the notion of 
heterogeneity and the results of this paper extend to hierarchical data modeling 
in general. 

Summarizability is a particular case of the problem of using materialized 
views to compute aggregate views [SSJL96]. Little has been said about the in- 
terplay between constraints and aggregate view rewriting in the context of OLAP 
dimensions and hierarchies. Our work establishes the foundations for further re- 
search in this area. 




Heterogeneous Multidimensional Schemas 389 



Acknowledgements. This research was supported by the National Science 

and Engineering Research Council and the Institute for Robotics and Intelligent 

Systems of Canada. 

References 

[AGS”''96] R. Agrawal, A. Gupta, S. Sarawagi, P. Deshpande, S. Agarwal, 
J. Naughton, and R. Ramakrishnan. On the computation of multidi- 
mensional aggregates. In Proceedings of the 22nd VLDB Conference, 
Bombay, India, 1996. 

[CD97] S. Ghaudhuri and U. Dayal. An overview of data warehousing and OLAP 

technology. In ACM SICMOD Record 26(1), March 1997. 

[CT97] L. Cabibbo and R. Torlone. Querying multidimensional databases. In 

Proceedings of the 6th DBPL Workshop, East Park, Golorado, USA, 1997. 

[GBLHP96] J. Gray, A. Bosworth, A. Layman, and H. H. Pirahesh. Data cube : A 
relational operator generalizing group-by, cross-tab and sub-totals. In 
Proceedings of the 12th lEEE-ICDE Conference, New Orleans, Los An- 
geles, USA, 1996. 

[Gol81] B. A. Goldstein. Constraints on null values in relational databases. In 

Proceedings of the 7th VLDB Conference, Cannes, France, 1981. 

[HMV99a] C. Hurtado, A. Mendelzon, and A. Vaisman. Maintaining data cubes 
under dimension updates. In Proceedings of the 15th lEEE-ICDE Con- 
ference., Sydney, Australia, 1999. 

[HMV99b] C. Hurtado, A. Mendelzon, and A. Vaisman. Updating OLAP dimen- 
sions. In Proceedings of the 2nd lEEE-DOLAP Workshop, Kansas City, 
Missouri, USA, 1999. 

[HRU96] V. Harinarayan, A. Rajaraman, and J. Ullman. Implementing data cu- 
bes efficiently. In Proceedings of the 1996 ACM-SIGMOD Conference, 
Montreal, Canada, 1996. 

[JLS99] H. V. Jagadish, L. V. S. Lakshmanan, and D. Srivastava. What can hier- 
archies do for data warehouses? In Proc. of the 25th VLDB Conference, 
Edinburgh, Scotland, UK, 1999. 

[LAW98] W. Lehner, H. Albrecht, and H. Wedekind. Multidimensional normal 
forms. In Proceedings of the 10th SSDBM Conference, Capri, Italy., 1998. 

[LS97] H. J. Lenz and A. Shoshani. Summarizability in OLAP and statisti- 

cal databases. In Proceedings of the 9th SSDBM Conference, Olympia, 
Washington, USA, 1997. 

[Mal93] Francesco Malvestuto. A universal-scheme approach to statistical data- 
bases containing homogeneous summary tables. In ACM Transactions 
on Database Systems, Vol. 18, No. f., December 1993. 

[RS90] M. Rafanelli and A. Shoshani. Storm: A statistical object representation 

model. In Proceedings of the 5th SSDBM Conference, Charlotte, N.C., 
USA, 1990. 

[SSJL96] D. Srivastava, D. Shaul, H. V. Jagadish, and A. Levy. Answering que- 
ries with aggregation using views. In Proceedings of the 22nd VLDB 
Conference, Bombay, India, 1996. 




Estimating Range Queries Using Aggregate 
Data with Integrity Constraints: A Probabilistic 

Approach 



Francesco Buccafurri^, Filippo Furfaro^, and Domenico Sacca^ 

^ DIMET, University of Reggio Calabria, 89100 Reggio Calabria, Italy, 
buccaSing . unir c . it 

^ DEIS, University of Calabria, 87030 Rende, Italy, furfaro@si.deis.unical.it 
® ISI-CNR & DEIS, 87030 Rende, Italy, sacca@unical.it 



Abstract. In fast OLAP applications it is often advantageous to provide 
approximate answers to range queries in order to achieve very high per- 
formances. A possible solution is to inquire summary data rather than 
the original ones and to perform suitable interpolations. Approximate 
answers become mandatory in situations where only aggregate data are 
available. This paper studies the problem of estimating range queries (na- 
mely, sum and count) over aggregate data using a probabilistic approach 
for computing expected value and variance of the answers. The novelty of 
this approach is the exploitation of possible integrity constraints about 
the presence of elements in the range that are known to be null or non- 
null. Closed formulas for all results are provided, and some interesting 
applications for query estimations on histograms are discussed. 



1 Introduction 

Traditional query processing deals with computing exact answers by possibly mi- 
nimizing response time and maximizing throughput. However, a recent querying 
paradigm, on-line analytical processing (OLAP) [11,13,5], often involves complex 
range queries over very large datacubes (i.e., multidimensional relations with di- 
mension and measure attributes) so that the exact answer may require a huge 
amount of time and resources. As OLAP queries mainly deals with operations of 
aggregation (e.g., count and sum) of the measure values on dimension ranges, an 
interesting approach to improve performances is to store some aggregata data 
and to inquiry them rather than the original data thus obtaining approximate 
answers — this approach is very useful when the user wants to have fast answers 
without being forced to wait a long time to get a precision which often is not 
necessary. 

The possibility of returning approximate answers for range queries has been 
first explicitly addressed in [12] but, in that case, the approximation is temporary 
since results are output on the fly while the tuples are being scanned and, at 
the end, after all original tuples are consulted, the user will eventually get the 
correct answer. 
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The issue of computing approximate range query answers by never acces- 
sing original tuples but only consulting aggregate data has very recently started 
receiving a deal of attention. Typical approaches consist in re-using statistical 
techniques which have been applied for many lusters inside query optimizers for 
selectivity estimation [17]. We recall that three major classes of statistical tech- 
niques are used for selectivity estimation: sampling, histograms and parametric 
modeling — see [2] for a detailed survey. Interesting applications of sampling and 
histogram techniques already exist, see for instance [8,9] and [10], respectively. 
The usage of sampling techniques are also used for approximate join-queries 
answering [1]. A recent technique for selectivity estimation, wavelet-based hi- 
stograms, has been already applied to approximate answering of range queries 
[18]. 

In this paper we propose a probabilistic approach to compute approximate 
answers to range queries (in particular, count and sum queries) by consulting a 
compressed representation of the datacube, that is a partition of the datacube 
into blocks of possibly different sizes storing a number of aggregate data (number 
of non-null tuples and sum of their measure values) for each block. Our approxi- 
mated results will come with a detailed analysis of the possible error so that, if 
the user is not satisfied with the obtained precision, s/he may eventually decide 
to submit the query on the actual datacube. In this case, it is not necessary to 
run the query over all tuples but only on those portions of the range that do not 
fit the blocks. 

Our approach is not concerned with the problem of finding the most effective 
compressed representation of a datacube to increase accuracy in query estima- 
tion — instead this is the main goal of the sampling and histogram techniques 
mentioned above. We are involved with the ’’apparently” simpler problem of 
performing interpolation of aggregate data once the compressed representation 
for the datacube has been decided. This means that our approach can be also 
used to interpolate data from summarized ones for which detail tuples are not 
available. This case has been first studied in [6] and interesting results have been 
obtained by enforcing the optimization of some criterion like the smoothness of 
the distribution of values. Our approach does not make any assumption on data 
distribution and perform estimations extending the probabilistic framework in- 
troduced in [3]. 

The novelty of our approach is that we exploit additional information on a 
datacube that is often available under the form of integrity constraints. In par- 
ticular, we assume the existence of constraints stating that a minimum number 
of null or non- null tuples are present in given ranges. Such a situation often 
arises in practice. For instance, given a datacube whose dimensions are the time 
(in terms of days) and the products while the measure is the amount of daily 
product sales, realistic integrity constraints are that the sales are null during the 
week-end while at least 4 times a week the sales are not null. 

In the paper we analyze two types of integrity constraints: 

— number of elements that are known to he null: we are given a function LB=q 

returning, for any range R, a lower bound to the number of null tuples 
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occurring in Z? — so |i?| — LB=q(R) is an upper bound on the number of 
non-nulls occurring in the range; 

— number of elements that are known to he non-null: we are given a function 
LByo returning, for any range R, a lower bound for the number of non-null 
tuples occurring in R. 

The two functions LB^q and LB^q are assumed to be monotone, to require a 
little amount of additional storage space and to be computable very efficiently 
— actually in time constant w.r.t. the size of the compressed datacube. Possible 
future research directions could explore other types of integrity constraints: then 
the problem of interpolating data from compressed representation could eventu- 
ally enter the field of knowledge discovery and data mining. This explain why 
we have above stressed that the problem is only apparently simple. 

Our problem is therefore the following: given a range R inside a datacube 
block B for which we know the count t (the number of non-null tuples) and 
the sum s of their measure values, we want to compute the estimation (mean 
and variance) of the count t/j and the sum sr for the range R, knowing that 
(|i?| — LB—q(R)) < < LB^q(R) and (|i?| — LB—q(R)) < tj^ < LByo(R), 

where R is the range in B complementary to R. 

The results we provide are formulas for mean and variance of both count and 
sum queries; besides the formulas are closed so that they can be computed very 
efficiently. For instance, suppose that the block B has size 120, the number of 
non-nulls in it is 80 and their sum is 12000 and that the range R consists of 
the first 30 tuples in the block. Without integrity constraints we have that the 
expected value for sr is obviously (30/120) x 12000 = 3000 — note that the 
knowledge about the number of non-nulls does not contribute to the estimation. 
Suppose now that we know that at least 4 of the first 20 tuples in the block 
are null but at least 2 of them are not null; moreover, at least half of the last 
20 tuples are not null. Thus, LB=q{R) = 4, LByo{R) = 2, LB^o{R) = 0 and 
LB^q(R) = 10. By applying our formulas we now obtain that the expected value 
for Sr is 4100. 

Estimating mean values is not enough in most situations: we also need to 
compute the possible error in the estimation. For instance, given a block of size 
100 and sum 10000 and given a range R coinciding with half of the block, the 
expected value for the sum in R is independent from the number of non-nulls in 
B. But it is obvious that the error in the case this number is 2 is much higher 
than for the case with, say, 90 non-nulls; so we need to consult the variance of 
the estimation before concluding that it is meaningful. Our results include closed 
formulas also for the variance of both count and sum queries. Indeed the proofs 
of such formulas are rather long and complex so we have included only one proof 
in the appendix. Besides, for reason of space, all the other proofs are either left 
out or only sketched. 

The paper is organized as follows. In Section 2 we introduce the compressed 
representation of a datacube and the integrity constraints about the number of 
null or non-null tuples in the datacubes ranges. In Section 3 we fix the proba- 
bilistic framework for estimating count and sum range queries on a datacube 
M by means of random queries variables over the population of all datacubes 
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which both have the same aggregate data as M and satisfies the integrity con- 
straints. For the sake of the presentation, in Section 4 we perform range query 
estimation for the simple case that only integrity constraints about the null tu- 
ples are available (i.e., only the function LB^q is given); the general case is 
treated in the subsequent section. Finally, in Section 6 we give some interesting 
applications of our formulas for the estimation of frequency distribution inside a 
bucket of a histogram [14,15,16]. The most common approach is the continuous 
value assumption [17]: the sum of frequencies in a range of a bucket is estima- 
ted by linear interpolation. We shall show that this computation does not yield 
a correct estimation for the case of bucket whose extremes (or at least one of 
them) is known to be not null. This situation, that arises for many of the most 
popular histogram representations, can be formalized in terms of our integrity 
constraints, thus obtaining more accurate estimations as well as the evaluation 
of their errors. 



2 Compressed Datacubes and Integrity Constraints 

Let i = <ii, . . . ,ir> and j = <ji,...,jV> be two r-tuples of cardinals, with 
r > 0. We extend common operators for cardinals to tuples in the obvious way: 
i < j means that b < ji, . . . v < jV; i+j denotes the tuple <ii +ji, . . . , ir+jr> 
and so on. Given p > 0, p*" (or simply p, if r is understood) denotes the r-tuple 
of all p. Finally, [i..j] = [zi..ji, . . . ,ir--jr] denotes the range of all tuples from i 
to j, that is {q| i<q<j}. 

A multidimensional relation i? is a relation whose scheme consists of r > 

0 dimensions (also called functional attributes) and s > 0 measure attributes. 
The dimensions are a key for the relation so that no two tuples have the same 
dimension value. For the sake of presentation but without loss of generality, we 
assume that 

~ s = 1 and the domain of the unique measure attribute is the set of cardinals, 
and 

— r > 1 and the domain of each dimension q, 1 < q < r, is the range [l..n,j], 
where Ug > 2, i.e., the projection of R on the dimensions is a subset of [l..n], 
where n = <ni, . . . ,Ur>. 

Given any range [i..j], 1 < i < j < n, we consider the following range queries on 
R: 

— count query: (i?) denotes the number of tuples of R whose dimension 

values are in [i..j], and 

— sum query: (i?) denotes the sum of all measure values for those tuples 

of R whose dimension values are in [i..jj. 

Since the dimension attributes are a key, the relation R can be naturally viewed 
as a [l..n] matrix (i.e., a datacube) M of elements with values in Af such that 
for each i G [l..n], M[i] = u if the tuple <i, v> is in R or otherwise M[i] = 0 — 
then i is a null element if either <i, 0> is in R or no tuple with dimension value 

1 is present in R. The above range queries can be now re-formulated in terms of 
array operations as follows: 
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— count'^^- '^\R) = count = |{q| q G and M[q] > 0}|; 

~ = sum(M[i..j]) = EqG[i..j] 

We next introduce a compressed representation of the relation R by dividing 
the datacube M into a number of blocks and by storing a number of aggregate 
data for each of them. To this end, given m = <mi, . . . , mr> in a m- 

compression factor for M is a tuple F = </i, . . . , fr>, such that for each q, 
^ < q < r, fq is a, [0..mq] array for which 0 = /<;[0] < /,[!] < • • • < fq[mq] = Ug. 
For each k = <ki, . . . , kr> in let T’“''(k) and F~{k) denote the tuples 

</i[*i],---, fr[kr]> and - 1] + 1, . . . , fr[kr - 1] + 1>, respectively . 

Therefore, F divides the range [l..n] into mi x • • • xmr blocks i?k, one for each 
tuple k = <ki, . . . , kr> in the block i?k has range [F’“(k)..F’+(k)] and 

size (/i[fci]-/i[fci-l]) X • • • x{fr[kr]~ fr[kr-l]) if k > 1 Or fi[ki] X • • • xfr[kr] 
otherwise. 

For instance, consider the [1..10,1..6] matrix M in Figure 1(a), which is 
divided into 6 blocks as indicated by the double lines. We have that m = <3, 2>, 
/i[0] = 0, /i[l] = 3, /i[2] = 7, /i[3] = 10, and ^[O] = 0, ^[l] = 4, M2] = 6. 
The block has size 3x2 and range [1..3, 1..4]; the block i?<i^ 2 > has size 

3x2 and range [1..3,5..6], and so on. 
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(a) 



||([1..3,1..4], 8,26)11 ([1..3,5..6],5,2^ 
||([4..7,1..4], 7,18)11 ([4..7,5..6],7,~^ 
||([8..10,1..4],4,4)||([8..10,5..6],6';^ 



(b) 



Fig. 1. A two-dimensional datacube and its compressed representation 



A compressed representation of the datacube M consists of selecting a m- 
compression factor F and storing the following aggregate data on the F’-com- 
pressed blocks of M: 

— the [l..m] matrices Mcount,F and Msum,F such that for each k G [l-.m], 
M,sA^] = cs{M[F-{k)..F+{k)]) 
where cs stands for count or sum; 

The compressed representation of the datacube M in Figure 1(a) is represented 
in Figure 1(b) by a matrix of triples, one for each block; the values of each triple 
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indicates respectively the range, the number of non-null elements and the sum 
of the elements in the corresponding block. For instance, the block has 

range [1..3, 1..4] and 8 non-null elements with sum 26; the block S<i^2> has 
range [1..3,5..6] and 5 non- null elements with sum 29, and so on. 

We assume that the we are given a compressed representation of a datacube 
M as well as additional information on M under the form of integrity constraints 
on the content of M. The representation of such constraints needs a little amount 
of additional storage space; besides, the requirements defined by the constraints 
are expressed in terms of aggregate data which are computed by suitable fun- 
ction in constant time — thus the functions are not dependent on the actual 
contents of M. As discussed in the Introduction, data distributions often match 
this property in real contexts. For instance, consider the case of a temporal di- 
mension with granularity day and a measure attribute storing the amount of 
sales for every day. In this case, given any temporal range, it is easily recogniz- 
able a number of certain null values, corresponding to the holidays occurring in 
that range. In such cases, the constraints provide additional information that 
can be efficiently computed with no overhead in terms of storage space on the 
compressed representation of M . 

Let 2[^- "l be the family of all subsets of indices in [l..n]. We analyze two 
types of integrity constraints: 

— number of elements that are known to he null: we are given a function LB=q : 
2[i..n] j\f returning, for any D in 2(1.. 

"1, a lower bound to the number of 
null elements occurring in D; the datacube M satisfies LB^q if for each D 
in count{M[]]) < \D\ — LB^q{D), where \D\ is the number of 

elements of M in Z?; 

— number of elements that are known to he non-null: we are given a function 

LByQ : 2[^- "l — >■ M returning, for any D in 2[^- "l, a lower bound for the 
number of non-null elements occurring in D\ the datacube M satisfies LB^q 
if for each D in 2^^ -"I, count{M[i]) > LB^o{D). 

The two functions LB^q and LB^q are monotone: for each D' , D" in 2^^ -"I, 
if D' C D" then both LB^o(D') < LB=o{D") and LByo(D') < LB>o{D"). 

Suppose that Li3=o([4..6, 1..3]) = 3 and Li3>o([4..6, 1..3]) = 1 in our runn- 
ning example. Then we infer that the number of non- null elements in the range 
[4. .6, 1..3] is between 1 and (6 — 4 -|- 1) x (3 — 1 -I- 1) — 3 = 6. Note that the com- 
pressed representation of M in Figure 1(b) only says that the block [4. .7, 1..4] 
has 7 non-nulls; so, from this information, we only derive that the bounds on 
the number of non- null elements in [4..6, 1..3] are 0 and 7. 



3 The Probabilistic Framework for Range Query 
Estimation 

We next introduce a probabilistic framework for estimating the answers of range 
queries {sum and count) by consulting aggregate data rather than the actual 
datacube. To this aim, we consider the queries as random variables and we 
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give their estimation in terms of mean and variance. More precisely, a range 
query Q on a given datacube M is estimated by a random query variable Q, 
defined by applying Q on a datacube M extracted from the population of all 
datacubes, whose compressed representations is ’compatible’ with the one of M. 
Thus, the estimation of the range query Q is only based on the knowledge of 
the compressed representation of M. A crucial point in such estimation is the 
definition of population ’compatible’ with the compressed representation of the 
given datacube M. 

We start from the population of the datacubes having the same aggregate 
data that we assume available for M: M~^p is the set of all the [l..n] matrix M' 
of elements in M for which both p = Mcount,F and p = Msum,F- 

We next restrict the population M~^p by considering only those datacubes which 
satisfy a given set of integrity constraints on the number of non- null elements. 

Let us now define the random variables for the estimation of the count and 
the sum query. 

Let the query count and srtm(M[i..j]) be given and let LB=q and 
LB>o be two integrity constraints that are satisfied by M. We shall estimate 
the two queries with the two random query variables count{M[i..}]) and 
sum{M[i..j]), respectively, in the following two cases: 

1. for M extracted from the population (TLB=oi^cs^F) ~ i^'\ ^ M~^p 

and M' satisfies LB=o}i thus we estimate the number and the sum of the 
non- null elements in M[i..j] by considering the population of all datacubes 
having both the same sum and the same number of non-nulls in each block 
as M and satisfying the lower bound constraint enforced by the function 
LB^q on the number of null elements occurring in each range; 

2. for M extracted from the population cTLB=o,LB^oi^cs^F) ~ i^'\ ^ 

M~^f satisfies LB=q and LB>o}; thus we estimate the number 

and the sum of the non-null elements in M[i..j] by restricting the popula- 
tion of the previous case to those datacubes which also satisfy the lower 
bound constraint enforced by the function LB^q on the number of non-null 
elements occurring in each range. 

We observe that Case 1 can be derived from the more general Case 2 but, 
for the sake of presentation, we first present the simpler case and then we move 
to the general case. 

Once the datacube population for a random variable query{M[i..j]) (where 
query stands for count or sum) is fixed, we have to determine its probability 
distribution and then its mean and variance — recall that both mean and vari- 
ance are defined by the operator E. Concerning the mean, due to the linearity 
of E we have: 

E{query{M[i..S\) = ^ Mq^eryA^ + X! E{query{M[iy^..]A)) 



where 
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1. returns the set of blocks Bq that are totally contained in the range 
[i.j], i.e., both i < F"(q) and F+(q) < j, 

2. PBF{i..j) returns the set of blocks Bk that are partially inside the range, 

i.e., i?k ^ TBpii..]) and either i < F~(k) < j or i < < j, and 

3. for each Bk G PBrii-'j), ik and jk are the boundaries of the portion of the 
block Bk which overlaps the range [i..j], i.e., [ik--jk] = [i--j]n[F“(k)..F+(k)]. 

For example, for the datacube in Figure 1(a), given i = <4, 3> and j = 
<8,6>, the block S< 2 , 2 > is totally contained in the range, the blocks S< 2 ,i>) 
^< 3 ,i>) ^< 3 , 2 > are partially contained in the range (with boundaries [4. .7, 3. .4], 
[8..8,3..4] and [8..8,5..6], respectively), and the blocks S<i, 2 > are ou- 

tside the range. 

Concerning the variance, we assume statistical independence between the 
measure values of different blocks so that its value is determined by summing 
the variances of all partially overlapped blocks, thus introducing no covariance. 

a"^ {query {M[i..j]) = ^ a‘^{query{M[ik..jk]))- 

B^ePEpCi.-i) 



It turns out that we only need to study the estimation of a query ranging 
on one partial block as all other cases can be easily re-composed from this basic 
case. Therefore, from now on we assume that the query range [i..j] is strictly 
inside one single block, say the block i?k, i.e., T’“(k) < i < j < F+(k). We use 
the following notations and assumptions: 

1. b, b > 1, is the size of i?k) thus b is the number of elements in i?k; 

2. 1 < bi,,j < b, is the size of [i..j], that is the number of elements in the 
range; 

3. t — FIqq-hjii f [k],l < t < b, is the number of non- null elements in Bk; 

4. s = Mg„m,F[k], s > max{l,t), is the sum of the elements in Bk; 

5. = 6i.,j — LB=o([i-.j]) and tf" ■ = LB>o([i..j]) are respectively an upper 
bound and a lower bound on the number of non-null elements in the range 

[i-j]; 

6. t~. = b.~. — Ti3=o([i-.j]) and t~, = LS>o([i..j]) are respectively an upper 

1..J 1--J 1..J 

bound and a lower bound on the number of non-null elements in the block 
Bk outside the range [i..j]; 

7. = b - TB=o([i--j]) - LB=o{[i7.i]) and ■ + tP. = 

•J 1..J "J 1..J 

LB>o([i..j]) -I- TB>o([iCj]), where [iCj] denotes the set of elements that are 
in i?k but not in the range [i..j]; and are an upper bound and a lower 
bound on the number of non- null elements in i?k. 

Observe that the functions LB^q and LB^o are computed for the ranges 
[i..j] and [iTij] but not for the whole block Bk- Indeed, and do not in 
general coincide with Li?>o([^”(k)..T'’*'(k)]) and b — Li3^o([^~(k)..F’''"(k)]), 
respectively, as the latter ones may be stricter bounds. For instance, suppose 
that the block stores the bimonthly sales of a store and we want to estimate 
the sales in the first month. The integrity constraints say that the store closes 4 
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days every month and an additional day every two months. So = 60 — 4 = 56 
and not 55. Thus the additional day is not taken into account but this does not 
affect at all the accuracy of the estimation: indeed we have available the actual 
number of opened days for the block of two months. 



4 Case 1: Range Query Estimation Using Upper Bounds 
on the Number of Non-null Elements 

In this section we only consider upper bounds on the number of non-null elements 
in the ranges [i..j] and [Uj] which are derived by the function LB^q. We define 
the random variables count{M[i..^]) and sum{M[i..^]) by extracting M from the 
population aLB=o{M~^p) of all datacubes having both the same sum and the 
same number of non-nulls in each block as M and satisfying the upper bound 
constraints on the number of elements in each range. We assume that both fU 
and are equal to zero, i.e., both Li?>o([i..j]) = 0 and Ti?>o([Uj]) = 0. 

Theorem 1. Let C'i([i..j]) = count and S'i([i..j]) = sum{M[i..j]) be 
two integer random variables ranging from 0 to t and from 0 to s, respectively, 
defined by taking M in the datacube population d/fU = t^ = 0 

then for each j and Si.j, 0 < ti.j < fU and 0 < Si.j < s, the joint probability 
distribution P(C'i([i..j]) = ti.j, S'i([i..j]) = Si.j) is equal to: 



P(Ci([i..j]) = 5'i([i..j]) = Si,.j) 



h..j) Si..j) ■ ji ^i.^) 

Q{t^,t,s) 



where t.~. = t — U s.~. = s — si and 

1..J ‘-J’ 1..J ‘-J’ 

0 if (i=0 As > 0) V ft > 0 As < t) V t > t,, 



S) \ 



1 



s — 1 
s — t 



if t = 0As = 0 
otherwise. 



Proof. (Sketch) The probability distribution does not change if we reduce the 
size of the range and of the block by removing certain null elements. Therefore, 
the size of the block Bk is assumed to be t^ and, then, the size of the query 
query becomes fU . We can now see the block as a vector, say V, of t^ elements 
with values in [l..s] such that their total sum is s and the number of non null 
elements is t. We divide V into two subvectors V' and V” such that V' consists 
of the first fU elements and V” of the last t = t^ — tU ones. The probability 

of the event (Ci([i..j]) = fi.j A 5'i([i..j]) = Si.j) is then equal to the probability 
that V contains ti,,j non-null elements whose sum is Si.j. Let denote by P this 
probability. Observe that the event implies that V" contains t — ti j non null- 
elements whose sum is s — Si.j. It is then easy to see that P is equal to 

^i .j> ^i .j) ■ ~ ^F.j) ^ ~ ^i .j) ® ~ 

Q{t^,t,s) 
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where Q{tu,i,s) is the number of possible configurations for a vector of size 
containing exactly t non-null elements with total sum s. 

Q{tu,t,s) can be determined by considering all possible ways of distributing 
the sum s into t non-fixed positions by assigning to each of such elements a value 
from 1 to s. If we fix the positions of the t non-null elements, we obtain that the 
number of such configurations is: 



m{t, s) = 



t+{s — t) — l 
s — t 



s — 1 
s — t 



As the positions for the t non-null elements are not fixed, we have to multiply 
m{t, s) by the number of all possible dispositions of t non-nulls over xd positions, 
that is 

Hence, Q{tu,t,s) = • m{t,s). □ 



Mean and variance of the random variable (7i([i..j]) are presented in the next 
proposition. 

Proposition 1. Let Ci([i..j]) be the random variables defined in Theorem 1. 
Then, mean and variance are, respectively: 



if(Ci([i..j]))=^.t 



Proof. (Sketch) Consider the vector V, V and V" defined in the proof of Theo- 
rem 1. The event (C'i([i..j]) = ti.j) is equivalent to the event that V contains 
exactly ti.j non-null elements. Observe that the probabilty that an element is not 
null is equal to tjtP . Hence (Ci([i..j]) = ti.j) is in turn equivalent to the event 
of extracting fi j non-nulls from V in trials. This probability is described by 
the well-known hypergeometric distribution [7]. □ 

Now we determine mean and variance of the random variable S'i([i..j]). 

Theorem 2. Consider the random variable S'i([i..j]) defined in Theorem 1. 
Then, mean and variance o/ S'i([i..j]) are, respectively: 

s-tV.. t’fi 

a2(5i([i..j])) = ^ [t^ ■{2-s-t+l)-s-{t+ 1)]. 
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Proof. (Sketch) Consider the vector V, V and V" defined in the proof of Theo- 
rem 1. The event (S'i([i..j]) = Si.j) is equivalent to the following event: the sum of 
all elements in V is Si.j. From s = X)i<i<tc we derive s = E{V[i\) 

by linearity of the operator E. Further, the mean of random variable V[i] is equal 
to the mean of the random variable V\j], for any i,j, 1 < i,j < . Indeed, for 

symmetry, the probability that an element of V assumes a given value is inde- 
pendent on the position of this element inside the vector. Let denote by m this 
mean. From the above formula for s it then follows that m-t^ = s, thus t = s/t^ . 
Consider now the vector V' . Let S' be the random variable representing the sum 
of all elements of V . Then E{S') = tV ^ ■ m. Hence, E{S') = ^ ■ s/t^ . 

The variance can be obtained using its definition. To this end, we first need 
to determine the probability distribution of 5'i([i..j]) from the joint probability 
distribution obtained in Theorem 1. The detailed proof is rather elaborated and, 
for the sake of presentation, is included in the appendix. □ 

Note that the mean of the random variable S'i([i..j]) representing the sum 
query does not depend on the number t of non- null elements in the block Hk. 
On the other hand, the knowledge about certain null elements derived by the 
function LB^q does influence the value of the sum. Indeed, the mean depends 
both on the size of the query range w.r.t. the size of the block and on the number 
of the nulls that are already known to be in the range and in the complementary 
part of the block. 



5 Case 2: Range Query Estimation Using Both Lower 
Bounds and Upper Bounds on the Number of Non-null 
Elements 



We are now ready to perform the estimation of range queries in the general case 
where the datacube population is the set o-LS=o,is>o (-^csV) datacubes 

having the same aggregate data (count and sum) as M and satisfying both 
constraints: the lower bound on the number of null elements occurring in each 
range and the lower bound on the number of non-null elements. 



Theorem 3. Let C 2 ([i..j]) = count and S' 2 ([i..j]) = sum{M[\..]]) he two 
integer random variable ranging from 0 to t and from 0 to s, respectively, defined 
by taking M in the datacube population crLB=o,LB^oi^c^F)- Then, for each ti.j 
and Si.j, tfj < ti,,j < tU and 0 < Si.j < s, the joint probability distribution 
P(C 2 ([i..j]) = ti..j,5'2([i..j]) = Si,.j) is equal to: 



d^(C'2([i..j]) =ti..j, S'2([i-j]) = Si..j) = 



> *i..j . Si..j > ^L.j) • , tc 



, s.~.,tE) 

I-J’ i.J 



N{t^ , t, s, t^) 
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where t.~. = t — U s.~. = s — Si and 

1..J ‘-J’ 1..J ‘-J’ 







if t> tu V t> s V(t = 0 A s > 0) 
if t = 0 A s = 0 

otherwise 



Proposition 2. Consider the random variable C' 2 ([i..j]) of Theorem 3. Then, 
mean and variance are: 



E{C2{[im = tti + 



tV.-th. 

'•i-j 

tu-t^ 



■{t-t^) 






tV.-th. 



l(t" - ib - (4.J - 4.j)l ■ 

- t^) ■ {t’^ -t^ -1) 



t) 



Theorem 4. Consider the random variable S' 2 ([i..j]) of Theorem 3. Then, mean 
and variance are: 



E{S,{[,.M=tty- + {tV. 



,L ^ f 

t' tu -t^- 



a2(^2([i..j])) = a • (€j - tti) • • 1 + (tli - tti - 1) 



2 



(/3 + 2 • a • t; j) • (tb j - t; j) • + (a • t.^ ^ + j3 ■ t; j) - 7 



where: 



_ s • (s + 1) 

t • (t + 1) 



/3 



s ■ {s — f) 

t • (t + 1) 



and 7 = j • ^ + {t^ ^ 




t - t^ 
tu -t^- 



Note that, unlike the case 1, the mean S' 2 ([i..j]) of the random variable repre- 
senting the sum query depends on the number t of non null elements occurring 
in the block B]^. Indeed, in this case, the information encoded in the function 
LB^q actually invalidates the symmetry condition about the aggregate informa- 
tion used for the estimation, on which the independence of the mean from t is 
based. This happens since LB^q returns a number of certain non-null elements, 
thus giving a positive contribution both to the count and the sum query. Thus, 
such positions cannot be eliminated in order to re-formulate the query into a 
new query applied on a block with indistinguishable positions as it happens for 
the Case 1. 

Also in this case, the mean is not in general a linear function respect to the 
size of the query, since it depends both on tY ■ (and then on Li?=o([i--j])) and on 
j (and then on LS>o([i..j])). Thus, once again, the estimation depends on the 
actual distribution of the values inside the block and not only on the aggregate 
information concerning count and sum of the block. 
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6 Estimation of Range Queries on Histograms 

Histograms are mono-dimensional compressed datacube that are used to sum- 
marize the frequency distribution of an attribute of a database relation for the 
estimation of query result sizes [14,15,16]. The estimation is made using aggre- 
gate data such as the number t of non-null values in each block (bucket in the 
histogram terminology) Bk, the total frequency sum s in Bk and the boundaries 
of Hfc. A crucial point for providing good estimations is the way the frequency 
distributions for original values are partitioned into buckets. Here we assume 
that the buckets have been already arranged using any of the known techniques 
and we therefore focus on the problem of estimating the frequency distribution 
inside a bucket. 

The most common approach is based on the continuous value assumption [17] : 
the sum of frequencies in a range of a bucket is estimated by linear interpolation. 
It thus corresponds to equally distributing the overall sum of frequencies of the 
bucket to all attribute values occurring in it. This result can be derived from 
Theorem 2 by assuming that there are no integrity constraints on the number 
of null and non- null elements. 

Corollary 1. Let Hk be a block of a histogram and let S' 3 ([i..j]) = sum{M[i..j]) 
be an integer random variable ranging from 0 to s, defined by taking M in the 
datacube population Then mean and variance of S' 3 ([i..j]) are, respec- 

tively: 

E(S3([i.Jl)) = Al., 

[if [[r,, 42 ■ . - f + 1) - » ^ (t + 1)1. 

Thus our approach gives a model to explain the linear interpolation and, 
besides, allows to evaluate the error of the estimation, thus exploiting the kno- 
wledge about the number t of non-nulls in a block — instead t is not mentioned 
in the computation of the mean. 

We now recall that the classical definition of histogram requires that both 
lowest and highest elements (or at least one of them) of any block are not null 
(i.e., they are attribute values occurring in the relation). A block for which the 
extreme elements are not null are called 2-biased; if only the lowest (or the 
highest) element is not null then the block is called Tbiased. 

So far linear interpolation is also used for biased blocks thus producing a 
wrong estimation — it is the case to say a ’’biased” estimation. We next show 
the correct formulas that are derived from Theorem 4. 

Corollary 2. Let Hk be a block of a histogram and let S' 4 ([i..j]) = sum{M[i..'f\) 
be an integer random variable ranging from 0 to s, defined by taking M in the 
datacube population crLB^oi^cs^F) ■ Then 
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1. if the block i?k is 1-biased and i is the lowest element of the block then mean 
and variance 0 / 54([i..j]) are, respectively: 

a2(54([i..j])) = a • (6i..j - 1) • H • [1 + - 2) • lEl] + 

(/3 + 2 ■ a) • (5i„j - 1 ) . I 5 I + (a + /3) - iJ(^ 4 ([i..j]))" 

2. if the block i?k is 1-biased and i is not the lowest element of the block then 
mean and variance 0 / S' 4 ([i..j]) are, respectively: 

a2(^4([i..j])) = Q*i..j-^- [1 + - 1) • ^]+/3-&i..j-^-i?(^4([i..j]))^ 

3. if the block is 2-biased and either! or j is an extreme element of the block 
then mean and variance 0 / S' 4 ([i..j]) are, respectively: 

a^(54([i..j])) = a • (6i..j - 1) • H • [1 + - 2) • fE§] + 

{(3 + 2 -a)- (6i„j - 1) . |5| + (a + /3) - iJ(^4([i..j]))" 

4-. if the block Bk is 2-biased and neither i nor j is an extreme element of the 
block then mean and variance 0 / S' 4 ([i..j]) are, respectively: 



cr^(S'4([i..j])) = Q*i„ 






where: 



_ s • (s + 1) 

t • (t + 1) 



and j3 



s ■ {s — t) 
t ■ {t -\- 1) 



The above formulas have been used in [4] to replace the continuous value 
assumption inside one of the most efficient methods for histogram representation 
(the maxdiff method [16]) and have produced some meaningful improvements in 
the performance of the method. 

In [16,15], another method for estimating frequency sum inside a block is 
proposed, based on the uniform spread assumption: the t non-null attribute 
values in each bucket are assumed to be located at equal distance from each other 
and the overall frequency sum is therefore equally distributed among them. This 
method does not give a correct estimation unless we assume that nun-nulls are 
scattered on the block in some particular, unrealistic way. Our approach gives 
instead an unbiased estimation. 
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Abstract. Constrained clustering — finding clusters that satisfy user- 
specified constraints — is highly desirable in many applications. In this 
paper, we introduce the constrained clustering problem and show that 
traditional clustering algorithms (e.g., fc-means) cannot handle it. A 
scalable constraint-clustering algorithm is developed in this study which 
starts by finding an initial solution that satisfies user-specified constraints 
and then refines the solution by performing confined object movements 
under constraints. Our algorithm consists of two phases: pivot movement 
and deadlock resolution. For both phases, we show that finding the op- 
timal solution is NP-hard. We then propose several heuristics and show 
how our algorithm can scale up for large data sets using the heuristic 
of micro-cluster sharing. By experiments, we show the effectiveness and 
efficiency of the heuristics. 



1 Introduction 

Cluster analysis has been an active area of research in computational stati- 
stics and data mining with many algorithms developed. However, few algo- 
rithms incorporate user-specific constraints in cluster analysis. Many studies 
show that constraint-based mining is highly desirable since it often leads to ef- 
fective and fruitful data mining by capturing application semantics [NLHP98, 
KPR98,LNHP99]. This is also the case in cluster analysis. 

Formally, the unconstrained clustering problem can be defined as follows. 
Unconstrained Clustering (UC): Given a data set D with n objects, a distance 
function df : D x D — > 5R, and a positive integer k, find a k-clustering, i.e., a partition 
of D into k disjoint clusters (C/i, . . . , Clk) such that DISP = (X/i-i disp(CU,repi)) 
is minimized. 

The “dispersion” of cluster CU, disp{Cli,repi), measures the total distance 
between each object in CU and the representative repi of Cli, i.e., disp{Cli, repi) 
is defined as ^p^cu The representative of a cluster Ck is chosen such 

that disp{Cli,repi) is minimized. Finding such a representative for each cluster 
is generally not difficult. For example, the /c-means algorithm uses the centroid 
of the cluster as its representative, which can be calculated in linear time. 

The constrained clustering problem can be defined as follows. 

Constrained Clustering (CC): Given a data set D with n objects, a distance func- 
tion df : DxD — > IR, a positive integer k, and a set of constraints C, find a k-clustering 
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(C/i, . . . , Clk) such that DISP = disp{Ck, repi)) is minimized, and each cluster 

Cli satisfies the constraints C, denoted as Ck |= C. 

A fundamental difference between the UC and CC problems is that the un- 
constrained clustering algorithms are designed to find clusterings satisfying the 
nearest rep (resentative) property (NRP), defined below, whereas for the CC pro- 
blem, the NRP may conflict with constraint satisfaction. 

The Nearest Rep (resentative) Property (NRP): Let {Ch, Clk) be the fc-cluste- 
ring computed by the algorithm, and let repi denote the representative of cluster Ch, 

1 < i < k. Then a data object p £ D is placed in a cluster Clj iff repj is the closest 

to p among all the representatives, i.e., (Vp G < j < k) [p £ Clj (yi 

j) df(p,repj) < df{p,repi)]. 

In this paper, we study the CC problem. A taxonomy of constraints useful 
in applications is presented in Section 2. In Section 3, we review works related 
to the CC problem, and in Section 4, we analyze the major challenges of CC. 
In Section 5, we develop an algorithm for CC under an existential constraint. In 
Section 6, we study how to scale up our algorithm by micro-cluster sharing. The 
experiments evaluating the effectiveness of the proposed heuristics are reported 
in Section 7. Section 8 discusses the handling of other SQL aggregate constraints, 
and Section 9 concludes the paper. 

2 A Taxonomy of Constraints for Clustering 

Depending on the nature of the constraints and applications, the CC problem 
can be classified into the following categories. 

1. Constraint on individual objects: This constraint confines the set of objects 
to be clustered, e.g., cluster only luxury mansions of value over one million 
dollars. It can be easily handled by preprocessing (e.g., performing selection 
using an SQL query), after which the problem reduces to an instance of the 
UC problem. 

2. Obstacle objects as constraints: A city may have rivers, bridges, highways, 
lakes, mountains, etc. Such obstacles and their effects can be captured by 
redefining the distance functions df{) among objects. Once that is done, the 
problem again reduces to an instance of the UC problem. 

3. Clustering parameters as "constraints": Some “constraints” may serve as the 
parameters in a clustering algorithm, e.g., the number of clusters, k. Such 
parameters, though specifiable by users, are not considered as constraints in 
our study. 

4. Constraints imposed on each individual cluster: This is the theme of our study. 
Within this class, we focus on constraints formulated with SQL aggregates. 

Let each object Oj in the database D be associated with a set of m attributes 
{Al, . . . , Ajn}. The value of an attribute Aj of an object Oj is denoted as Oi[Aj], 

Definition 1 (SQL Aggregate Constraints). Consider the aggregate func- 
tions agg G {max{),min0, avg{), sum{)}. Let 0 be a comparator function, i.e., 
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6 G {<, <,^,=, >, >}, and c represent a numeric constant. Given a cluster Cl, 
an SQL aggregate constraint on Cl is a constraint in one of the following forms: 
(i) agg{{Oi[Aj] \ Oi G Cl}) 9 c; or (ii) count{Cl) 6 c. □ 

In this paper, we mainly focus on one type of constraints, called existential 
constraints: 

Definition 2 (Existential Constraints). Let W C D he any subset of ob- 
jects. We call them pivot objects. Let c be a positive integer. An existential 
constraint on a cluster Cl is a constraint of the form: count{{Oi\Oi G Cl, Oi G 
W}) > c. □ 

Pivot objects are typically specified via constraints or other predicates. For ex- 
ample, in a market segmentation problem, pivot objects might be frequent custo- 
mers. See Section 8 for a discussion on the generality of existential constraints. 



3 Related Work 

Cluster analysis has been an active area in computational statistics and data 
mining. Clustering methods can be categorized into partitioning methods [KR90, 
NH94], hierarchical methods [KR90,ZRL96], density-based methods [EKSX96], 
grid-based methods [WYM97,AGGR98], and model-based methods [HaKaOO]. 
However, none of the existing methods incorporates user-specified constraints. 

A problem somewhat similar to the CC problem is the facility location pro- 
blem [STA97], mostly studied in operational research and theoretical computer 
science. It tries to locate k facilities to serve n customers such that the traveling 
distance from the customers to their facility is minimized. However, the only 
type of constraints they studied are constraints on the capacity of the facility, 
i.e., each facility can only serve a limited number of customers. If we assume 
that customers cannot be “split” between two facilities (as we do for GG), the 
resultant solution will require an increase both in the number of facilities and 
in the capacity of these facilities. However, if the customers can be “split” , only 
the number of facilities needs to be increased. Such an increase in number of 
facilities and capacity is inappropriate for the GG problem as we treat user’s 
constraints as hard constraints. 

Since GG is a kind of constrained optimization problem, mathematical pro- 
gramming naturally comes to mind. Our concern, however, is its scalability with 
respect to a large database. To cluster n customers into k clusters, a mathema- 
tical programming approach will involve at least k x n variables. As n can be as 
large as a few millions, it is very expensive to perform mathematical program- 
ming. As can be seen later, our solution for handling a very large dataset involves 
a novel concept called micro-cluster sharing. This may correspond to dynamic 
combining and splitting of equations in a mathematical program, which has not 
been considered in mathematical programming but could be an interesting future 
direction for performing mathematical programming in large databases. 
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4 The Nearest Representative Property (NRP) 

We first consider the theoretical implication of adding constraints to clustering, 
by examining the popular fc-means algorithm although the discussion generalizes 
to other algorithms, such as the /c-medoids algorithm. 

Given a set of constraints C, a “solution” space for the CC problem is defined 
as, 

ClSp{C, k, D) = {{Ck , . . . , Clk) \\/l<i,j<k:(hcCljCDSzClj\=CSz 
U Clj =DSzCkn Clj = 0, for i ^ j} 

We refer to ClSp{C, k, D) as the (constrained) clustering space. Clusterings found 
by the fc-means algorithm satisfy the NRP. Accordingly, the constrained mean 
solution space is defined as: 

MeanSp{C, fc, D) = {{Ch , . . . , Ck) \ {Ch, Ck) G ClSp{C, k, D) 
tyi < j < k,yq G D : {q G Clj {^i j ■ df{q,Pj) < df{q,p,)))} 

where pj is the centroid of cluster Clj. It should be clear by definition that 
the mean space MeanSp{) is a strict subset of the clustering space ClSpQ. 
To understand the role played by the NRP, let us revisit the situation when 
the set of constraints C is empty. The fc-means algorithm does the smart thing 
by operating in the smaller MeanSpQ space than in the ClSp{) space. More 
importantly, the following theorem says that there is no loss of quality. Unless 
stated otherwise, proofs of the results in this paper can be found in [TNLHOO] 
but are omitted here for lack of space. 

Theorem 1. A clustering UCC is an optimal solution to the UC problem in the 
space ClSpkh, k, D) iff it is an optimal solution to the UC problem in the mean 
spa.ce MeanSp{lJ),k,D). □ 

Like virtually all existing clustering algorithms, the /c-means algorithm does 
not attempt to find the global optimum. This is because the decision problem 
corresponding to /c-clustering is NP-complete even for fc = 2 [GJ79]. Thus, the k- 
means algorithm focuses on finding local optima. Theorem 1 can be generalized 
from the global optimum to a local optimum. 

The point here is that MeanSp{$, k, D) contains the “cream” of ClSp 
(0, k, D), in that the global and local optima in ClSp^h^ k, D) are also contained 
in the smaller MeanSp{9,k, D). This nice situation, however, does not gene- 
ralize to the CC problem. For example, suppose there are only four customers 
with three located close to each other at one end of a highway, and the fourth 
at the other end. If the CC problem is to find two clusters with (at least) two 
customers in each, it is easy to see that it is impossible to satisfy the constraint 
and the NRP simultaneously. 

To resolve this conflict, we adopt the policy that the user-defined constraints 
take precedence over the NRP. Specifically, the algorithm to be presented next 
regards the set C to be hard constraints that must be satisfied. The NRP, on the 
other hand, is treated as a “soft” constraint in the sense that it is satisfied as 
much as possible by the minimization of (X]?=i disp{Cli,repi)). But there is no 
guarantee that every object is in the cluster corresponding to its nearest center. 
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5 Clustering without the Nearest Representative 
Property 

In this section, we will develop an algorithm to perform CC under an existential 
constraint. An important difference of our method from the UC algorithms is 
that our algorithm tries to find a good solution by performing cluster refinement 
in the constraint space, ClSp{C, k, D), which we represent using a clustering 
locality graph, Q = (y,S), described as follows: 

— The set V of nodes is the set of all /c-clusterings. More precisely, it is the 
unconstrained clustering space ClSp{th, k, D). Nodes which satisfy existential 
constraint {EC) are called valid nodes, and those that do not are called 
invalid nodes. 

— There is an edge e between two nodes CL\, CL 2 in the graph iff they are diffe- 
rent by only one pivot object, i.e., CL\ of the form {Cl\, . . . , Ck, . . . , Clj, . . . , 
Clk), whereas CC 2 of the form {Cl\, . . . , Ck — {p}, . . . , Clj U {p}, . . . , Clk) 
for some pivot object p € Ck & j yf L If a node CC 2 is connected to CC\ by 
an edge, then CL 2 is called a neighbor of CLi and vice versa. 

With such a graph, a naive algorithm to solve the CC problem given k and EC 
is to first pick a valid node in the locality graph and move to a valid neighboring 
node which gives the highest decrease in DISP. Intuitively, such a node mo- 
vement is a cluster refinement process similar to the fc-means algorithm which 
tries to refine the clustering by moving objects to the nearest center to reduce 
DISP. The cluster refinement process terminates when no node of lower DISP 
is found. The algorithm will then output CC as the solution. However, this is a 
generate-and-test algorithm which is inefficient since the number of neighbors of 
a node is potentially large. To improve its efficiency, the number of nodes to be 
examined needs to be restricted. 



5.1 Cluster Refinement under Constraints 

To derive a more efficient algorithm for CC, we first define a set of unstable 
pivots given a valid node CC = {Cl\, ...Clk). 

Definition 3. (Unstable Pivots) A set of unstable pivots, S, with respect to CC 
is a collection of all pivots in D such that each s G S belongs to some Ck in CC 
but s is nearer to a representative of some Clj, j ^ i. □ 

Using S, we form a subgraph of G, viz., SQ = (SV,SS), where the set of 
nodes 5V is defined as follows: (1) (base case) the initial node CC, representing 
the chosen valid clustering is in 5V; (2) (inductive case) for any node CC in 
iSV, if (i) there is an object s in Ck whose nearest cluster representative is in 
Clj, and (ii) CC is of the form {Cl\, . . . , Ck, ■ ■ ■ , Clk), then the node CC' of the 
form {Cli, . . . , Ck — {s}, ■ • ■ , Clj U {s}, . . . , Clk) is also in 5V; and (3) there is 
no other node in 5V. Intuitively, once S is defined, the subgraph SG includes all 
the nodes that are reachable from CC via the movements of some s G S' to their 
nearest cluster. Let us denote the DISP of any node v with respect to a set of 
representatives REP as DISPrep{v)- 
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Theorem 2. DISPrep'{CC) < DISPrep{CC) for any nodes CC,CC in SQ 
as above, REP and REP' being the set of representatives for CL and CL' res- 
pectively. 

Proof: Let REP = {repi, ...,repk) and REP' = {rep'^, ...,rep'if). The dispersion 
of C£' calculated with respect to REP will be DISPrep{CL') = (X)?=i disp{Cl'^, 
repi)). 

We first observe that 

DISPrep{CL') < DISPrep{CL) 

This is because the set of representatives is the same on both sides of the ine- 
quality and since CL' can be obtained by moving some s S S' to their nearest 
representative in REP, the reduction in dispersion will result in the above ob- 
servation. On the other hand, since REP' is a set of representatives for CL' , 
by definition they will minimize the dispersion for we thus have the 

following inequality, 

DISPrep’{CL') < DISPrep{CL') 

By combining these two inequalities together, we have 

DISPrep'{CL') < DISPrep{CL') < DISPrep{CL) □ 

By Theorem 2, we conclude that our clusters can in fact be refined just by 
searching SQ. There are two advantages to doing this. First, our efficiency impro- 
ves because the number of nodes to be searched is reduced, and the movement 
always leads to progressive refinement in clustering quality. This in itself does 
not guarantee the chosen neighbor is valid. Second, instead of considering only 
neighbors, SQ allows us to consider nodes that are many steps away. 

Given SQ, we adopt the steepest descent approach and plan a path along 
the valid nodes of which leads to a new valid node CL' with minimized 
dispersion in SQ. We call this problem the Best Path (BP) Problem. To plan the 
path, only unstable pivots in a surplus cluster (cluster which have more objects 
than required by EC) can be moved. We call such an object, a movable object. 
To gain more insight into the BP problem and to derive an algorithm for solving 
it, we introduce a concept called pivot movement graph which can be used to 
represent the state of clustering in each node of SQ. 

Definition 4. (Pivot Movement Graph) A pivot movement graph is a directed 
graph in which each cluster is represented by a node. An edge from CU to Clj 
indicates that there is at least one unstable pivot object in CU that has Clj 
as its nearest center. These objects are represented as labels on the edge. The 
reduction in DISP when an unstable object is moved to its nearest center is 
shown next to each of these objects. □ 

Figure 1 shows an example of a pivot movement graph which is under the 
constraint ^Ni, count {C If) > 50”. As such, the surplus clusters at this instance 
are Cl\, CI 3 and CI 5 . Figure 2 shows the actual situation depicted by the pivot 
movement graph in Figure 1. For clarity, only the unstable pivots and the cluster 
representatives (marked by a “x”) are shown. Given a pivot movement graph, 
a Pivot Movement (PM) problem is the problem of computing a schedule of 
movements for the unstable objects in the graph such that the total reduction 
in DISP is maximized. 




Constraint-Based Clustering in Large Databases 411 




Fig- 1- A Pivot Movement Graph Fig. 2. The Actual Situation. 



Theorem 3. The BP problem is equivalent to the PM problem. 

Proof: Given an optimized solution for BP, we follow the path given in the 
solution and move the pivots in the corresponding pivot movement graph. This 
will give a maximized reduction in dispersion. Similarly, if an optimized schedule 
is given for PM, we can follow the schedule and move along a path where each 
node in the path corresponds to a state of the pivot movement graph when the 
schedule is followed. This will bring us to a node with minimized dispersion in 

sg. □ 

Given their equivalence, it suffices to focus on the PM problem. 

Definition 5. (The PM Decision Problem) Given a pivot movement graph and 
an existential constraint EC, the PM decision problem is to determine whether 
there is a schedule of movements of objects around the clusters such that EC is 
satisfied at all times and the total dispersion being reduced is > B where B is a, 
numeric constant. □ 

Two observations hint at the difficulty of this problem. (1) The movement 
of an unstable pivot object could possibly trigger a series of movements of other 
unstable pivot objects. For example, by moving O 3 from Cl\ to CI 2 , CI 2 now has 
51 pivot objects, and thus we could move Og from CI 2 to CI 3 . We refer to such 
a series of triggerings as a movement path. (2) Given a surplus cluster with 
more than one outgoing edge, the choice of the outgoing edge that minimizes 
DISP in the resultant movement path is not obvious. Indeed we can show: 

Theorem 4. The PM decision problem is NP-complete. 

Proof: See [TNLHOO]. □ 

Furthermore, by using a result given in [KMR97], we can show that it is not 
possible to compute in polynomial time a constant factor approximation for the 
PM problem. Thus, an alternative is to use heuristics which could work well in 
practice and efficient enough for handling a large dataset. The purpose of the 
heuristic is to iteratively pick an edge in the pivot movement graph and move an 
unstable object on the edge to its nearest representative thus forming a schedule 
of movements for the unstable pivots. 
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We experiment with two heuristics. The first is a random heuristic in which 
a random edge is selected from those edges that originate from a surplus cluster; 
whereas the second is a look-ahead I heuristic which looks ahead at all possible 
movement paths originating from a surplus cluster, of length up to I, and selects 
the best among them. The selected movement path is then activated, resulting 
in a movement of up to I objects depending on the length of the path. Since 
there are at most k{k — 1) edges, there are at most 0{k{k — 1)*"'’^) movement 
paths. While there exist optimization strategies that can avoid examining all 
the qualifying movement paths of length I, the worst case complexity of this 
heuristic remains 0{k{k — 1)*"'’^). Thus, the value of I is designed to be a small 
integer. 

Using these heuristics, our corresponding movement in SQ will eventually 
reach a node CC" where future movement is no more possible. We then repeat 
the process and form a new subgraph SQ for processing. 



5.2 Handling Tight Existential Constraints 

While the cluster refinement algorithm discussed earlier works well under most 
constraints, problem arises when the constraint EC is tight, i.e., when it is 
nearly impossible to be satisfied. For example, given k = 5, \D\ = 100 and 
EC = {count(Cli) > 20}, 1 < f < 5, our algorithm may get into a deadlock 
cycle. A sequence of clusters (C/i, . . . , Clk, Cli) is said to be in a deadlock cycle 
of length k if (a) all the clusters are non-surplus; and (b) there is an edge in the 
pivot movement graph from Cli to CTi+i, 1 < i < k — 1 and one from Clk to 
Cli, respectively. 

In terms of the graph, SQ, a tight EC means that SQ contains a large 
number of invalid nodes and refining the clusters by movement through only 
valid nodes is not possible. In view of this, a deadlock resolution phase is added 
before computing a new subgraph SQ. The objective of the deadlock resolution 
phase is to provide a mechanism to jump over a set of invalid nodes by resolving 
deadlock in the pivot movement graph. Similar to the PM problem, we can prove 
(in a way similar to that for Theorem 4) that resolving deadlock optimally is 
NP hard. 

Similarly, we can show that there is also no constant factor approximation 
algorithm for the deadlock resolution problem which runs in polynomial time. 
Thus, we resort to the following heuristic based on a randomized strategy. It 
conducts a depth-first search on the pivot movement graph to find any deadlock 
cycle. Suppose the deadlock cycle detected is {Cli, . . . , Clk,Cl\). Let nt denote 
the number of unstable pivot objects appearing as labels on the edge from Cli 
to Cli+i. Then let Umax denote the minimum rii value among the edges in the 
cycle, i.e., Umax = rnini<i<k{ni} . This marks the maximum number of unstable 
objects that can be moved across the entire cycle without violating EC. Once 
the nmax value has been determined, the heuristic would move the unstable pivot 
objects with the top-Umax highest reduction in DISP across each edge of the 
cycle causing the cycle to be broken. 
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5.3 Local Optimality and Termination 

Having introduced our algorithm, we will now look at its formal properties by 
analyzing the two main phases of the algorithm: pivot movement and deadlock 
resolution. Our algorithm essentially iterates through these two phases and com- 
putes a new subgraph SQ at the end of each iteration. 

Local Optimality Result. Having modeled our cluster refinement algorithm 
as a graph search, we would like to establish that at the end of each iteration, 
the clustering obtained corresponds to a local minimum in the subgraph SQ. 
However, since all dispersion of nodes in SQ is actually computed with respect 
to the cluster representatives of CC, when there is a pivot movement, say object 
p moved from Ck to Clj, both the representatives of CU and that of Clj change, 
and the set of unstable pivots S can also change, which means that SQ itself 
must be recomputed. This process is time-consuming, especially for our look- 
ahead heuristic which must recompute SQ every step it looks ahead. Because 
of this, we choose to freeze the representatives of each cluster and avoid the 
recomputation of SQ. As such, the cost of each node CL in the subgraph SQ is 
not the true dispersion but rather the “approximated” dispersion, denoted as 
disp{CL), relative to the fixed representatives. Now we can establish the following 
result. Intuitively, at the end of the pivot movement phase, no surplus cluster in 
the pivot movement graph has an outgoing edge. Thus, it is not possible to find 
a valid neighbor of the current one that has a lower dispersion. 

Lemma 1. The clustering obtained at the end of the pivot movement phase 
is a local minimum in the subgraph SQ, where cost is based on approximated 
dispersion disp{CL). □ 

Interestingly, a deadlock cycle of length k corresponds to a path {CL \, . . . , 
CLk+i) in SQ, such that the first node/clustering CL\ and the last node CLk+i 
are valid, but all the other nodes are not. This is a very interesting phenomenon 
because resolving a deadlock cycle amounts to jumping from one valid node to 
another via a sequence of invalid nodes in SQ. In particular, if deadlock cycles 
are resolved after the pivot movement phase as in our algorithm, then we jump 
from a valid local minimum to another (which is not a neighbor) with a strictly 
lower value of dispersion. 

Lemma 2. The clustering obtained at the end of the deadlock resolution phase 
is a local minimum in the subgraph SQ, where cost is based on approximated 
dispersion disp{CL). □ 

Termination of the Algorithm. Since each move in the graph SQ corresponds 
to a reduction in the number of unstable pivot objects, and the number of 
unstable pivot objects is finite, both the object movement phase and deadlock 
resolution phase will terminate. Moreover, since we move to a node of lower 
DISP for every iteration, and is a finite clustering space, it is impossible to 
have the DISP value decreasing forever. Thus, the algorithm terminates. 
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6 Scaling the Algorithm by Micro-Cluster Sharing 

For clustering large, disk-resident databases, many studies have adopted a micro- 
clustering methodology (e.g., [ZRL96,WYM97,BFR98,KHK99]), which “com- 
presses” data objects into micro -clusters in a pre-clustering phase so that the 
subsequent clustering activities can be accomplished at the micro-cluster level. 
To ensure that not much quality is lost, a maximum radius on a micro-cluster is 
imposed. 

By micro-clustering, in our cluster refinement, instead of moving one unstable 
object across the edges of a pivot movement graph at a time, we have to move 
one micro-cluster. However, since each micro-cluster can contain more than one 
pivot object, it may not be possible to move a micro-cluster away from a surplus 
cluster without invalidating the constraint. Similar complication arises when 
resolving deadlock since there is no guarantee that for each edge in a cycle, the 
total number of pivot objects in the micro-clusters to be moved add up to exactly 

^max- 




Fig. 3. An Example of Micro-cluster Sharing 



To resolve these problems, we introduce a novel concept called micro-cluster 
sharing. Given a micro-cluster with n non-pivot objects and m pivot objects, 
the n non-pivot objects will always be allocated to the nearest cluster, while the 
m pivot objects can be shared among multiple clusters. For example, consider 
Figure 3 in which micro-cluster mci is formed from 5 non-pivot objects and 6 
pivot objects. It is shared by three clusters, Cli, CI 2 and CI 3 . Since CI 2 is the 
nearest to mci, it owns all 5 of mci’s non-pivot objects and also 2 pivot objects 
from mci. Cli, on the other hand, contains 3 pivot objects from mci, while CI3 
has 1 pivot object from mci. 

To record the sharing or “splitting” of mci into multiple clusters, we use 
the notation Cli.mci to represent the part of mci that is in CU. During the 
pivot movement and deadlock resolution phases, if p objects of Cli.mci are to 
be moved to Clj, the algorithm calls a function MovePivot(Cli, Clj, mci, p) 
which updates the numbers in Cli.mci and Clj.mci accordingly. In Figure 3, 
MovePivot(G^i, CI3, mci, I) moves one pivot object from Cli.mci to Cl^.mci. 

Given the MovePivot() function, the problem of being unable to shift micro- 
clusters around the clusters is effectively solved since the micro-clusters can now 
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be dynamically split and combined to cater to the condition for swapping. Since 
the number of objects in a micro-cluster is small enough for all of them to fit in 
main memory, the above heuristic requires a minimum amount of I/O. 

The remaining issue that we need to address is at the end of clustering, how 
to determine the actual objects in a micro-cluster me that are to be assigned to 
Cli, . . . ,Clg, where these are all the clusters for which Ck.mc is positive. We 
adopt the following greedy heuristic: For all the non-pivot objects in me, they 
are all assigned to the nearest center /cluster. This is to reduce DISP as much 
as possible. Consider the set of distances defined as: {df{0, Ck) | O is a pivot 
object in me, and I < i < q}. Sort this set of distances in ascending order. Based 
on this order, the pivot objects are assigned to the cluster as near as possible, 
while satisfying the numbers recorded in Cli.mc , . . . , Clq.mc. 



7 Performance Analysis 

We report our performance study, which evaluates the efficiency and effectiveness 
of the proposed heuristics. All the experiments were performed on a 450Mhz Intel 
Celeron, with 64MB of main memory, and an IBM 7200 rpm disk-drive. 

Two datasets were used in our experiments. The first, DSl, is a dataset of a 
courier company for planning collection centers based on the locations of their 
frequent customers (see [TNLHOO]). The second, DS2, is synthetic, generated 
following the synthetic datasets used in [ZRL96]. For lack of space, we report 
our experiments on DS2 only. 

For the constraints, we made all data objects to be pivot objects in order to 
give most vigorous tests to our algorithms. Micro-clustering in our experiments 
was done using the CF-tree in the BIRCH algorithm [ZRL96] which only needs 
to scan through the database once. Note that BIRCH is used here as a pre- 
processing step and it’s data structure is not utilized in any part of our algorithm. 

To separate the different heuristics used in our algorithms, we denote an 
algorithm as RandLS if it uses the random heuristic in pivot movement and 
LAHLS-1 if look-ahead-l heuristic. For the micro-cluster sharing version of the 
two algorithms, the term “Micro” is added, i.e., MicroRandLS and 
MicroLAHLS-1. 

Our synthetic dataset was generated using a modification of the synthetic da- 
taset from [ZRL96], with skewed density distribution for testing the scalability 
of our algorithms and how constraints affect the clustering of a dataset. Essenti- 
ally, there was a M x M grid in which cluster centers were placed. The distance 
between neighboring centers in the same row or column was set to 1. For a clu- 
ster centered at the coordinate {row, column), {{row — 1) x M + column) x 50 
points were generated following a 2-d normal distribution with center at {row, 
column) and variance 0.5^. By design, the density of the synthetic data was 
skewed, being least dense at the top-left corner of the grid and most dense at 
the bottom-right. The value of M was varied from 4 to 8, generating datasets 
with various sizes and numbers of clusters. Figure 5 shows the parameters used 
for each dataset. 
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As shown in Figure 6(a), the order of effectiveness of the various algorithms 
generally remains unchanged with LAHLS giving the best quality of clustering. 
The running times of both MicroRandLS and MicroLAHLS-4 remain relatively 
low as both \D\ and k increase. For a dataset with 104000 tuples, the running 
times of MicroRandLS and MicroLAHLS-4 were around 1100 and 1500 seconds 
respectively. 

To see the effect that an existential constraint has on the clustering, we look 
at the output of MicroLAHLS-4 in Figure 4 which shows a synthetic dataset with 
M = 8. The clustering was done with the existential constraint of “count{Cli) > 
812” imposed. Since the actual clusters that were generated near the top-left 
corner generally contained less than 812 points, there was a shift of cluster 
centers from the top-left corner towards the dense region at the bottom-right 
corner. 

To summarize, our experimental results show that the our algorithm is ef- 
fective for constrained clustering. Among the heuristics, micro-cluster sharing 
clearly delivers good efficiency and scalability. The gain in efficiency far offsets 
the small loss in quality. Finally, the look-ahead heuristic with small I (e.g., 4) 
appears to be the best candidate for pivot movement. 




M 


k 


1^1 


c 


No. of 

Micro-clusters 


4 


16 


6800 


212 


1410 


5 


25 


16250 


325 


2523 


6 


36 


33300 


462 


4050 


7 


49 


61250 


625 


7079 


8 


64 


104000 


812 


8253 



Fig. 4. 64 Clusters with > 812 Objects Fig. 5. Parameters for Synthetic Data- 
Ea.ch . .set. 



8 Discussion: Handling Other SQL Aggregate Constraints 



We have presented a cluster refinement algorithm which handles a single existen- 
tial constraint. In this section, we first examine how the algorithm can be exten- 
ded to handle constraints containing single SQL aggregate, where the constrains 
are classified into five classes (see Table 1) based on their behavior with respect 
to constrained clustering: existential, existential-like, universal, averaging, and 
summation. 
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(a) Average DISP per object (b) Running time against |D| 

against \D\ 



Fig. 6. Performance of Various Algorithms as \D\ Varies. 



Table 1. A Classification of SQL Constraints 





< or < 


A 


= 


> or > 


min 


existential 


existential-like 


existential-like 


universal 


max 


universal 


existential-like 


existential-like 


existential 


count 


existential-like 


existential-like 


existential 


existential 


avg 


averaging 


averaging 


averaging 


averaging 


sum 


summation 


summation 


summation 


summation 



1. Universal constraints: These are constraints in which a specific condition 
must be satisfied by every object in a cluster. For example, min{{Oi[Aj]\Oi G 
Cli}) > c requires that every object’s Aj-value be > c. This can be reduced to 
the UC problem as discussed for constraints on individual objects in Section 2. 

2. Existential-like constraints: These constraints are similar in nature to 
existential constraints, and our algorithm can handle them with simple modifi- 
cation. For example, count{Cli) < c is an existential-like constraint. Instead of 
moving surplus objects around, the objective here is to move “holes” around to 
achieve the maximum reduction in DISP. If a cluster Cli contains m objects, 
m < c, it has c — m holes, meaning that c — m objects can still be moved into it. 
When an object is moved from Clj into CU, a hole is moved from Cli into Clj. 
Correspondingly, a hole movement graph can be generated which could be used 
to guide movement in the locality graph. 

3. Averaging and Summation constraints. For these kinds of constraints, 
even computing an initial solution is an NP-hard problem similar to a bin- 
packing or knapsack problem [GJ79]. Handling general averaging and summation 
constraints in clustering is an interesting problem for future work. 
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Finally, we consider the situation when there are multiple conjunctive exi- 
stential constraints. The local search algorithm can easily be modified to handle 
existential constraints when the sets of pivot objects for these constraints do not 
overlap. The algorithm can then set up a different pivot movement graph for 
each constraint, and move the pivot objects in different graphs independently. 
However, for situations where the sets of pivot objects do overlap, again we can 
show even computing an initial solution is NP-hard. Handling multiple general 
existential constraints is another interesting problem for future work. 



9 Conclusions 



In this paper, we introduced and studied the constrained clustering problem, a 
problem which arises naturally in practice, but barely addressed before. A (con- 
strained) cluster refinment algorithm is developed, which includes two phases 
of movement in a clustering locality graph: pivot movement and deadlock reso- 
lution. Our experimental results show that both phases are valuable. To scale 
up the algorithm for large databases, we proposed a micro-cluster sharing stra- 
tegy whose effectiveness is also verified by our experiments. Our algorithm can 
also be extended to handle some other kinds of constraints, however, handling 
general averaging and summation constraints, as well as handling general mul- 
tiple existential constraints, are interesting topics for future research. Thanks to 
a reviewer of this paper, we have come to know of a recent study by Bradley 
et al. [BBDOO] on a version of constrained clustering problem similar to ours. 
Unlike us, their main motivation is using cardinality constraints for better qua- 
lity clustering. Scalability and applicability to other types of constraints are not 
addressed. Despite these differences, a quantitative comparison between the two 
approaches would be an interesting future work. 
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Abstract. In recent years, the effect of the curse of high dimensionality 
has been studied in great detail on several problems such as clustering, 
nearest neighbor search, and indexing. In high dimensional space the data 
becomes sparse, and traditional indexing and algorithmic techniques fail 
from a efficiency and/or effectiveness perspective. Recent research results 
show that in high dimensional space, the concept of proximity, distance 
or nearest neighbor may not even be qualitatively meaningful. In this 
paper, we view the dimensionality curse from the point of view of the di- 
stance metrics which are used to measure the similarity between objects. 
We specifically examine the behavior of the commonly used Lk norm 
and show that the problem of meaningfulness in high dimensionality is 
sensitive to the value of k. For example, this means that the Manhat- 
tan distance metric (Li norm) is consistently more preferable than the 
Euclidean distance metric (I /2 norm) for high dimensional data mining 
applications. Using the intuition derived from our analysis, we introduce 
and examine a natural extension of the Lfc norm to fractional distance 
metrics. We show that the fractional distance metric provides more mea- 
ningful results both from the theoretical and empirical perspective. The 
results show that fractional distance metrics can significantly improve 
the effectiveness of standard clustering algorithms such as the k-means 
algorithm. 



1 Introduction 

In recent years, high dimensional search and retrieval have become very well 
studied problems because of the increased importance of data mining applica- 
tions [1], [2], [3], [4], [5], [8], [10], [11]. Typically, most real applications which 
require the use of such techniques comprise very high dimensional data. For such 
applications, the curse of high dimensionality tends to be a major obstacle in the 
development of data mining techniques in several ways. For example, the per- 
formance of similarity indexing structures in high dimensions degrades rapidly, 
so that each query requires the access of almost all the data [1]. 
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It has been argued in [6], that under certain reasonable assumptions on the 
data distribution, the ratio of the distances of the nearest and farthest neighbors 
to a given target in high dimensional space is almost 1 for a wide variety of data 
distributions and distance functions. In such a case, the nearest neighbor problem 
becomes ill defined, since the contrast between the distances to different data 
points does not exist. In such cases, even the concept of proximity may not 
be meaningful from a qualitative perspective: a problem which is even more 
fundamental than the performance degradation of high dimensional algorithms. 

In most high dimensional applications the choice of the distance metric is 
not obvious; and the notion for the calculation of similarity is very heuristical. 
Given the non-contrasting nature of the distribution of distances to a given 
query point, different measures may provide very different orders of proximity 
of points to a given query point. There is very little literature on providing 
guidance for choosing the correct distance measure which results in the most 
meaningful notion of proximity between two records. Many high dimensional 
indexing structures and algorithms use the euclidean distance metric as a natural 
extension of its traditional use in two- or three-dimensional spatial applications. 
In this paper, we discuss the general behavior of the commonly used norm 
{x,y G 'R'^,k G Z, hk{x,y) = YHi=i{\W ~ in high dimensional space. 

The Lfe norm distance function is also susceptible to the dimensionality curse 
for many classes of data distributions [6]. Our recent results [9] seem to suggest 
that the L^-norm may be more relevant for A: = 1 or 2 than values of A: > 3. In 
this paper, we provide some surprising theoretical and experimental results in 
analyzing the dependency of the norm on the value of k. More specifically, 
we show that the relative contrasts of the distances to a query point depend 
heavily on the metric used. This provides considerable evidence that the 
meaningfulness of the norm worsens faster with increasing dimensionality for 
higher values of k. Thus, for a given problem with a fixed (high) value of the 
dimensionality d, it may be preferable to use lower values of k. This means that 
the Li distance metric (Manhattan Distance metric) is the most preferable for 
high dimensional applications, followed by the Euclidean Metric {L 2 ), then the 
L3 metric, and so on. Encouraged by this trend, we examine the behavior of 
fractional distance metrics, in which k is allowed to be a fraction smaller than 1. 
We show that this metric is even more effective at preserving the meaningfulness 
of proximity measures. We back up our theoretical results with empirical tests on 
real and synthetic data showing that the results provided by fractional distance 
metrics are indeed practically useful. Thus, the results of this paper have strong 
implications for the choice of distance metrics for high dimensional data mining 
problems. We specifically show the improvements which can be obtained by 
applying fractional distance metrics to the standard k-means algorithm. 

This paper is organized as follows. In the next section, we provide a theo- 
retical analysis of the behavior of the norm in very high dimensionality. In 
section 3, we discuss fractional distance metrics and provide a theoretical analy- 
sis of their behavior. In section 4, we provide the empirical results, and section 
5 provides summary and conclusions. 
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2 Behavior of the Zfc-Norm in High Dimensionality 

In order to present our convergence results, we first establish some notations and 
definitions in Table 1. 

Table 1. Notations and Basic Definitions 



Notation 


Definition 


d 


Dimensionality of the data space 


N 


Number of data points 


T 


1-dimensional data distribution in (0, 1) 


Xi 


Data point from with each coordinate drawn from T 


dist\(x,y) 


Distance between (a:^, . . . a:'^) and {if , . . . y'^) 
nsing Lk metric = 


II ' life 


Distance of a vector to the origin (0, . . . , 0) 
nsing the function disfif, ■) 


DmaXd = max{ Xd j,} 


Farthest distance of the N points 

to the origin nsing the distance metric Lk 


DmmS = min{||Xd||j^} 


Nearest distance of the N points 

to the origin nsing the distance metric Lk 


E[X], var[X] 


Expected valne and variance of a random variable X 


^ p C 


A vector sequence Yi, . . . , Yd converges in probability to a 
constant vector c if: Ve > 0 limd^oaP[distd{Yd, c) < e] = 1 



Theorem 1. Beyer et. al. (Adapted for Tj, metric) 

IJ Limd^oc var 



0 . 



Proof. See [6] for proof of a more general version of this result. 



The result of the theorem [6] shows that the difference between the maxi- 
mum and minimum distances to a given query point ^ does not increase as fast 
as the nearest distance to any point in high dimensional space. This makes a 
proximity query meaningless and unstable because there is poor discrimination 
between the nearest and furthest neighbor. Henceforth, we will refer to the ratio 

Dmax^ — Dining i ±- ± ± 

„ . f, as the relative contrast. 

Dmin^ 

The results in [6] use the value of Eis an interesting criterion 

for meaningfulness. In order to provide more insight, in the following we analyze 
the behavior for different distance metrics in high-dimensional space. We first 
assume a uniform distribution of data points and show our results for N = 2 
points. Then, we generalize the results to an arbitrary number of points and 
arbitrary distributions. 



^ In this paper, we consistently use the origin as the qnery point. This choice does not 
affect the generality of our results, though it simplifies our algebra considerably. 
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Lemma 1. Let T he uniform distribution of N = 2 points. For an Lk metric, 



ltm,i—^c 

stant. 






Dmax^—Dmin^ 


= c ■ 




If 1 ^ 


dl/fc-l/2 


[(k+TyJp J ^ 





where C is some con- 



Proof. Let Ad and Bd be the two points in a d dimensional data distribu- 
tion such that each coordinate is independently drawn from a 1-dimensional 
data distribution T with finite mean and standard deviation. Specifically Ad = 
{Pi . . . Pd) and Bd = {Qi ■ ■ ■ Qd) with Pi and Qi being drawn from T. Let 
PAd = be the distance of Ad to the origin using the Lk metric 

and PBd = the distance of Bd- The difference of distances is 

PAd - PBd = 

It can be shown ^ that the random variable has mean and standard 



deviation ^ ^ 2 -fc+i ) • "bhis means that 

and therefore 



PAd ^ ( I 

rfi/fc [k + 1 



l/k 



PBd 

di/k 



(fc-l-1) ’ 



l/k 



{PBd)f 



k + 1 



( 1 ) 



We intend to show that ~^p We can express 



\PAd — PBd\ in the following numerator/denominator form which we will use in 
order to examine the convergence behavior of the numerator and denominator 

l(P.U)* - (PB^ri p) 



iPAd - PBd\ = , , 

Y!Zl{PAd)^-"-^{PBd)^ 

Dividing both sides by dfP~^l'^ and regrouping the right-hand-side we get: 

\PAd - PBd\ ^ \{{PAd)^ - {PBd)^)\/Vd 

dl/k-l/2 ( PBd A 

2^r=0 \d^P‘) \d^/^) 



(3) 



Consequently, using Slutsky’s theorem ^ and the results of Equation 1 we obtain 




Having characterized the convergence behavior of the denominator of the right 
hand side of Equation 3, let us now examine the behavior of the numerator: 
\{PAd)'^ - (PPd)"|/Vd = - (a)'=)l/Vd = \ Yfi=lR^\l^d. Here 

Ri is the new random variable defined by {{Pi)^ — {Qi)^) Vz G {1, • ■ • d}. This 
random variable has zero mean and standard deviation which is -\/2 • cr where 

2 This is because F[Pf] = l/(k + 1) and = 1/(2 -k + l). 

® Slutsky’s Theorem: Let Vi ... Vd ... be a. sequence of random vectors and h(-) be 
a continuous function. If Yd —>p c then h(Yd) —>p h(c). 
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(7 is the standard deviation of {Pi)^ ■ The sum of different values of Ri over d 
dimensions will converge to a normal distribution with mean 0 and standard 
deviation \/2 ■ a ■ y/d because of the central limit theorem. Consequently, the 
mean average deviation of this distribution will he C ■ a for some constant C. 
Therefore, we have: 



limd^aoE 



- \{PAd)>^-{PBdy 

y/d 



= C- 



1 



fc+1 V 2-k + l 



( 5 ) 



Since the denominator of Equation 3 shows probabilistic convergence, we can 
combine the results of Equations 4 and 5 to obtain 



limd_>ooE 



\PAd - PBd\ 
dl/fe-l/2 



= c- 



(fc + l)i/fe V 2-fc+l 



(6) 



We can easily generalize the result for a database of N uniformly distributed 
points. The following Corollary provides the result. 



Corollary 1. Let T be the uniform distribution of N 




^ limd—yooE 



Dmax^ — Dmin^ 
di/fc-i/2 



< 



= n points. Then, 




Proof. This is because if L is the expected difference between the maximum and 
minimum of two randomly drawn points, then the same value for n points drawn 
from the same distribution must be in the range (L, (n — 1) • L). 



The results can be modified for arbitrary distributions of N points in a data- 
base by introducing the constant factor Cfc. In that case, the general dependency 
of Djnax — Dmin On dk ~2 remains unchanged. A detailed proof is provided in 
the Appendix; a short outline of the reasoning behind the result is available in 
[9]. 



Lemma 2. [9] Let T be an arbitrary distribution of N = 2 points. Then, 

Dmax^ — Dmin^ 



Itnid — ^ oo A 



dl/fc-l/2 



= Ck, where Ck is some constant dependent on k. 



Corollary 2. Let T be the arbitrary distribution of N = n points. Then, 



Ck C: ^'^m.d—^ooL] 



Dmax\ — Dmin^ 
di/fc-i/2 



< (n 



1) • Ck. 



Thus, this result shows that in high dimensional space DmaXd — Dmin^ in- 
creases at the rate of ^ independent of the data distribution. This means 

that for the manhattan distance metric, the value of this expression diverges to 
oo; for the Euclidean distance metric, the expression is bounded by constants 
whereas for all other distance metrics, it converges to 0 (see Figure 1). Further- 
more, the convergence is faster when the value of k of the Lk metric increases. 
This provides the insight that higher norm parameters provide poorer contrast 
between the furthest and nearest neighbor. Even more insight may be obtai- 
ned by examining the exact behavior of the relative contrast as opposed to the 
absolute distance between the furthest and nearest point. 
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Fig. 1. \Dmax — Dmin\ depending on d for different metrics (uniform data) 
Table 2. Effect of dimensionality on relative (Li and L 2 ) behavior of relative contrast 



Dimensionality 


P[Ud < Td\ 


1 


Both metrics are the same 


2 


85.0% 


3 


88.7% 


4 


91.3% 



Dimensionality 


P[Ud < Td\ 


10 


95.6% 


15 


96.1% 


20 


97.1% 


100 


98.2% 



Theorem 2. Let T he the uniform distribution of N = 2 points. Then, 

lim^i—^cQE 



f Dmax^—Dmin^\ 


■ ^/dl 


y Dmin^ J 


V 



;.fe+l ■ 



Proof. Let A^, Pi . . . Pd, Qi ■ . ■ Qd, PAd, PBd be defined as in the proof 
of Lemma 1. We have shown in the proof of the previous result that — >■ 
/ 1 

( fFT ) ■ Using Slutsky’s theorem we can derive that: 



. r PAd PBd , 

■""'fjiTF.rfiTrl 



k + 1 



l/k 



( 7 ) 



We have also shown in the previous result that: 



limd^ooE 



\PAd - PBd\ 

rfl/fc-l/2 



= c- 



(fc + l)l/fc 



1 



2- fc+ 1 



(8) 



We can combine the results in Equation 7 and 8 to obtain: 



limd^aoE 



Vd ■ 



\PAd - PBd\ 
mm{PAd,PBd}\ 



= CVV(2-fc + l) 



(9) 



Note that the above results confirm of the results in [6] because it shows that 
the relative contrast degrades as I/'/d for the different distance norms. Note 
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RELATIVE CONTRAST FOR UNIFORM DISTRIBUTION 




Fig. 2. Relative contrast variation with 
norm parameter for the uniform distribu- 
tion 
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that for values of d in the reasonable range of data mining applications, the 
norm dependent factor of ^\j(2 • A: -I- 1) may play a valuable role in affecting 
the relative contrast. For such cases, even the relative rate of degradation of 
the different distance metrics for a given data set in the same value of the 
dimensionality may be important. In the Figure 2 we have illustrated the relative 
contrast created by an artificially generated data set drawn from a uniform 
distribution in d = 20 dimensions. Clearly, the relative contrast decreases with 
increasing value of k and also follows the same trend as a / 1/(2 • fc -|- 1 ). 

Another interesting aspect which can be explored to improve nearest neigh- 
bor and clustering algorithms in high-dimensional space is the effect of k on the 
relative contrast. Even though the expected relative contrast always decreases 
with increasing dimensionality, this may not necessarily be true for a given data 
set and different k. To show this, we performed the following experiment on the 
Manhattan (Li) and Euclidean (L 2 ) distance metric: Let Ud = ^ ^ 

and Td = ^ ^ . We performed some empirical tests to calculate 

the value of P[Ud < Td] for the case of the Manhattan (Li) and Euclidean 
(L 2 ) distance metrics for = 10 points drawn from a uniform distribution. In 
each trial, Ud and Td were calculated from the same set of A = 10 points, and 
P[Ud < Td] was calculated by finding the fraction of times Ud was less than Td 
in 1000 trials. The results of the experiment are given in Table 2. It is clear that 
with increasing dimensionality d, the value of P[Ud < Td] continues to increase. 
Thus, for higher dimensionality, the relative contrast provided by a norm with 
smaller parameter k is more likely to dominate another with a larger parameter. 
For dimensionalities of 20 or higher it is clear that the manhattan distance me- 
tric provides a significantly higher relative contrast than the Euclidean distance 
metric with very high probability. Thus, among the distance metrics with inte- 
gral norms, the manhattan distance metric is the method of choice for providing 
the best contrast between the different points. This result of our analysis can be 
directly used in a number of different applications. 
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3 Fractional Distance Metrics 



The result of the previous section that the Manhattan metric (fc = 1) provides 
the best discrimination in high-dimensional data spaces is the motivation for 
looking into distance metrics with k < 1. We call these metrics fractional distance 
metrics. A fractional distance metric dist^ {Lf norm) for / G (0, 1) is defined 
as: 

d 

dist{{x, y) = Yl ■ 

To give a intuition of the behavior of the fractional distance metric we plotted 
in Figure 3 the unit spheres for different fractional metrics in 'R? . 

We will prove most of our results in this section assuming that / is of the form 
1/1, where I is some integer. The reason that we show the results for this special 
case is that we are able to use nice algebraic tricks for the proofs. The natural 
conjecture from the smooth continuous variation of dist^ with / is that the 
results are also true for arbitrary values of /. Our results provide considerable 
insights into the behavior of the fractional distance metric and its relationship 
with the Lfc-norm for integral values of k. 



Lemma 3. Let T he the uniform distribution of N = 2 points and f = 1/1 for 
some integer 1. Then, 






Dmax^— Dmin^ 
dV/-l/2 



= c 



■ ((/+ W ''0 V ( 2-/+0 



Proof. Let Ad, Bd, P\ . . . Pd, Qi ■ ■ ■ Qd, PAd, PBd be defined using the L / metric 
as they were defined in Lemma 1 for the metric. Let further QAd = (PAd)^ = 
(PArf)!/' = J2ti(P^y and QBd = {PBdY = {PBd)P^ = EtliQ^y ■ Analo- 
gous to Lemma 1, ^ -)>p ^ -)>p 



We intend to show that E 



\PAd-PBd\ 

d'-l/2 




( 



1 

U'+iy^f 




. The 



difference of distances is \PAd - PBd\ = ~ 

— {J2i=yPi)y’‘ ~ Note that the above expression is of the form 

|a* — 6*1 = |a — 6| • (X)t=o Therefore, \PAd — PBd\ can be written as 

{EiLi - (Q*)-^|} ■ {Et=o(QAd)’' • By dividing both sides by 

(|i//-i/2 regrouping the right hand side we get: 



\PAd - PBd\ 

rfi//-i/2 



-Ap { 



EUmy-iQ^y\ 

\fd 




1 — 1 1 

} (10) 



By using the results in Equation 10, we can derive that: 

\pAd-PBd\ , , Y.Umy ~{Qry\ , , 

dV/-l/2 ^ (1 + /)*-!^ 

Empirical simulations of the relative contrast show this is indeed the case. 



( 11 ) 
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This random variable (Pi)^ — (Qi)^ has zero mean and standard deviation which 
is -\/2 • a where a is the standard deviation of {Pi)^ ■ The sum of different values 
of {Pi)^ — {QiY over d dimensions will converge to normal distribution with 
mean 0 and standard deviation 2 • cr • \/d because of the central limit theorem. 
Consequently, the expected mean average deviation of this normal distribution 
is C • CT • ^fd for some constant C. Therefore, we have: 






- \{PAdy ~{PBdy\ 

\fd 



= C-a=C- 



/+1 




1 

2 -/ + 1 



(12) 



Combining the results of Equations 12 and 11, we get: 

= (u^) \/(^) 

An direct consequence of the above result is the following generalization to 
N = n points. 



lim^^oo-E 



\PAd - PBd\ 
di//-i/2 



Corollary 3. When T is the uniform distribution of N = n points and f = 1/1 
for some integer 1. Then, for some constant C we have: 

- ((/+i)i/0 

Proof. Similar to corollary 1. 

The above result shows that the absolute difference between the maximum 
and minimum for the fractional distance metric increases at the rate of df! . 
Thus, the smaller the fraction, the greater the rate of absolute divergence bet- 
ween the maximum and minimum value. Now, we will examine the relative 
contrast of the fractional distance metric. 



((/+^)i//) Y (2-/+i) — I'i'md^ooE 






Theorem 3. Let T he the uniform distribution of N = 2 points and f = 1/1 
for some integer 1. Then, 

f Dmax/^ — Dmin/^ 



Itnid—^c 



V 



Dn 



Vd = C ■ 2 - f+i some constant C. 



Proof. Analogous to the proof of Theorem 2. 



The following is the direct generalization to N = n points. 

Corollary 4. Let T he the uniform distribution of N = n points, and f = 1/1 
for some integer 1. Then, for some constant C 

^ ‘ 2-/-tT — ^^^d^ooE 



Dmax^ — Dmin 



Dmirij 






/+!■ 



Proof. Analogous to the proof of Corollary 1. 
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This result is true for the case of arbitrary values / (not just / = l/l) and 
-/V, but the use of these specific values of / helps considerably in simplification of 
the proof of the result. The empirical simulation in Figure 2, shows the behavior 
for arbitrary values of / and N . The curve for each value of N is different but all 
curves fit the general trend of reduced contrast with increased value of /. Note 
that the value of the relative contrast for both, the case of integral distance 
metric Lk and fractional distance metric Lf is the same in the boundary case 
when f = k = 1. 

The above results show that fractional distance metrics provide better con- 
trast than integral distance metrics both in terms of the absolute distributions 
of points to a given query point and relative distances. This is a surprising result 
in light of the fact that the Euclidean distance metric is traditionally used in 
a large variety of indexing structures and data mining applications. The wide- 
spread use of the Euclidean distance metric stems from the natural extension 
of applicability to spatial database systems (many multidimensional indexing 
structures were initially proposed in the context of spatial systems). However, 
from the perspective of high dimensional data mining applications, this natural 
interpretability in 2 or 3-dimensional spatial systems is completely irrelevant. 
Whether the theoretical behavior of the relative contrast also translates into 
practically useful implications for high dimensional data mining applications is 
an issue which we will examine in greater detail in the next section. 



4 Empirical Results 

In this section, we show that our surprising findings can be directly applied to 
improve existing mining techniques for high-dimensional data. For the experi- 
ments, we use synthetic and real data. The synthetic data consists of a number 
of clusters (data inside the clusters follow a normal distribution and the cluster 
centers are uniformly distributed). The advantage of the synthetic data sets is 
that the clusters are clearly separated and any clustering algorithm should be 
able to identify them correctly. For our experiments we used one of the most wi- 
dely used standard clustering algorithms - the k-means algorithm. The data set 
used in the experiments consists of 6 clusters with 10000 data points each and 
no noise. The dimensionality was chosen to be 20. The results of our experiments 
show that the fractional distance metrics provides a much higher classification 
rate which is about 99% for the fractional distance metric with / = 0.3 versus 
89% for the Euclidean metric (see figure 4). The detailed results including the 
confusion matrices obtained are provided in the appendix. 

For the experiments with real data sets, we use some of the classification 
problems from the UCI machine learning repository All of these problems 
are classification problems which have a large number of feature variables, and 
a special variable which is designated as the class label. We used the following 
simple experiment: For each of the cases that we tested on, we stripped off the 

http: j j WWW. cs.uci.edu I'mlearn 
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Fig. 4. Effectiveness of k-Means 



class variable from the data set and considered the feature variables only. The 
query points were picked from the original database, and the closest I neighbors 
were found to each target point using different distance metrics. The technique 
was tested using the following two measures: 

1. Class Variable Accuracy: This was the primary measure that we used 
in order to test the quality of the different distance metrics. Since the class va- 
riable is known to depend in some way on the feature variables, the proximity 
of objects belonging to the same class in feature space is evidence of the mea- 
ningfulness of a given distance metric. The specific measure that we used was 
the total number of the I nearest neighbors that belonged to the same class as 
the target object over all the different target objects. Needless to say, we do not 
intend to propose this rudimentary unsupervised technique as an alternative to 
classification models, but use the classification performance only as an evidence 
of the meaningfulness (or lack of meaningfulness) of a given distance metric. The 
class labels may not necessarily always correspond to locality in feature space; 
therefore the meaningfulness results presented are evidential in nature. However, 
a consistent effect on the class variable accuracy with increasing norm parameter 
does tend to be a powerful way of demonstrating qualitative trends. 

2. Noise Stability: How does the quality of the distance metric vary with 
more or less noisy data? We used noise masking in order to evaluate this aspect. 
In noise masking, each entry in the database was replaced by a random entry 
with masking probability Pc- The random entry was chosen from a uniform 
distribution centered at the mean of that attribute. Thus, when pc is 1, the data 
is completely noisy. We studied how each of the two problems were affected by 
noise masking. 

In Table 3, we have illustrated some examples of the variation in performance 
for different distance metrics. Except for a few exceptions, the major trend in 
this table is that the accuracy performance decreases with increasing value of the 
norm parameter. We have show the table in the range Lq.i to Lio because it was 
easiest to calculate the distance values without exceeding the numerical ranges in 
the computer representation. We have also illustrated the accuracy performance 
when the Loo metric is used. One interesting observation is that the accuracy 
with the Loo distance metric is often worse than the accuracy value by picking 
a record from the database at random and reporting the corresponding target 
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Table 3. Number of correct class label matches between nearest neighbor and target 
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Fig. 5. Accuracy depending on the norm Fig. 6. Accuracy depending on noise mas- 
parameter king 



value. This trend is observed because of the fact that the Loo rnetric only looks 
at the dimension at which the target and neighbor are furthest apart. In high 
dimensional space, this is likely to be a very poor representation of the nearest 
neighbor. A similar argument is true for L^ distance metrics (for high values of 
k) which provide undue importance to the distant (sparse/noisy) dimensions. 
It is precisely this aspect which is reflected in our theoretical analysis of the 
relative contrast, which results in distance metrics with high norm parameters 
to be poorly discriminating between the furthest and nearest neighbor. 

In Figure 5, we have shown the variation in the accuracy of the class variable 
matching with k, when the Lk norm is used. The accuracy on the T-axis is 
reported as the ratio of the accuracy to that of a completely random matching 
scheme. The graph is averaged over all the data sets of Table 3. It is easy to see 
that there is a clear trend of the accuracy worsening with increasing values of 
the parameter k. 

We also studied the robustness of the scheme to the use of noise masking. 
For this purpose, we have illustrated the performance of three distance metrics 
in Figure 6: Lo.ij Li, and Lio for various values of the masking probability on 
the machine data set. On the X-axis, we have denoted the value of the masking 
probability, whereas on the X-axis we have the accuracy ratio to that of a com- 
pletely random matching scheme. Note that when the masking probability is 1, 
then any scheme would degrade to a random method. However, it is interesting 
to see from Figure 6 that the Lio distance metric degrades much faster to the 
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random performance (at a masking probability of 0.4), whereas the L\ degrades 
to random at 0.6. The Lq.i distance metric is most robust to the presence of 
noise in the data set and degrades to random performance at the slowest rate. 
These results are closely connected to our theoretical analysis which shows the 
rapid lack of discrimination between the nearest and furthest distances for high 
values of the norm-parameter because of undue weighting being given to the 
noisy dimensions which contribute the most to the distance. 



5 Conclusions and Summary 

In this paper, we showed some surprising results of the qualitative behavior of 
the different distance metrics for measuring proximity in high dimensionality. 
We demonstrated our results in both a theoretical and empirical setting. In the 
past, not much attention has been paid to the choice of distance metrics used 
in high dimensional applications. The results of this paper are likely to have a 
powerful impact on the particular choice of distance metric which is used from 
problems such as clustering, categorization, and similarity search; all of which 
depend upon some notion of proximity. 
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Appendix 



Here we provide a detailed proof of Lemma 2, which proves our modified conver- 
gence results for arbitrary distributions of points. This Lemma shows that the 
asymptotical rate of convergence of the absolute difference of distances between 
the nearest and furthest points is dependent on the distance norm used. To re- 
cap, we restate Lemma 2. 



Lemma 2: Let T be an arbitrary distribution of N = 2 points. Then, 

j = where Ck is some constant dependent on k. 



Itmu 



Proof. Let and be the two points in a d dimensional data distribution 
such that each coordinate is independently drawn from the data distribution T . 
Specifically Ad = {Pi . . . Pd) and Bd = {Qi . . . Qd) with Pi and Qi being drawn 
from T. Let PAd = be the distance of Ad to the origin using 

the Lfc metric and PBd = the distance of Bd. 

We assume that the kth power of a random variable drawn from the dis- 
tribution T has mean and standard deviation This means that: 

PA^ PB^ 

-^p —df^ ~^p and therefore: 



PAd/d^/'^ , PBd/d^^>^ ^p . (14) 

We intend to show that Ck for some constant Ck depending 

on k. We express \PAd — PBd\ in the following numerator/denominator form 
which we will use in order to examine the convergence behavior of the numerator 
and denominator individually. 



IPA,-PBA= - (Pg-)"! 

Dividing both sides by and regrouping on right-hand-side we get 

\PAd-PBd\ \{PAd)^ - {PBd)^\/Vd 



(15) 



(16) 



J_l/k-l/2 ^k-1 ( PAb\’^-p-^ ( PB,i\P 

Z^r=0 X'dLTP I \~dT7P ) 

Consequently, using Slutsky’s theorem and the results of Equation 14 we have: 



fc-i 

E( 

r=0 



PA, 



k—r— 1 



(^PBd/d^/>^y k ■ (17) 



Having characterized the convergence behavior of the denominator of the right- 
hand-side of Equation 16, let us now examine the behavior of the numerator: 

\{PAd)>^ - {PBd)>^\/Vd = I Etl{{P^)'^ - {Q^)’^)\/Vd = I Eti 
Here Ri is the new random variable defined by {{Pi)^ — {Qi)^) Vi G {1, . . . d}. 
This random variable has zero mean and standard deviation which is \/2 • ajp,k 
where aj^^k is the standard deviation of {Pi)^ . Then, the sum of different values 
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of Ri over d dimensions will converge to a normal distribution with mean 0 
and standard deviation \/2 ■ ay^_k • Vd because of the central limit theorem. 
Consequently, the mean average deviation of this distribution will be (7 • crjF.fe 
for some constant C. Therefore, we have: 



lim, 



d—¥oo 



E 



- \{PAd)'^-{P Bd) 

Vd 



fcn 



= C' 






(18) 



Since the denominator of Equation 16 shows probabilistic convergence, we can 
combine the results of Equations 17 and 18 to obtain: 



limd^ooE 



\PAd - PBd\ 
rfl/fc-l/2 



= c- 






, (k—l)/k 



(19) 



The result follows. 

Confusion Matrices. We have illustrated the confusion matrices for two dif- 
ferent values of p below. As illustrated, the confusion matrix for using the value 
p = 0.3 is significantly better than the one obtained using p = 2. 

Table 4. Confusion Matrix- p=2, (rows for prototype, colums for cluster) 
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Table 5. Confusion Matrix- p=0.3, (rows for prototype, colums for cluster) 
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Abstract. Nearest-neighbor queries in high-dimensional space are of high impor- 
tance in various applications, especially in content-based indexing of multimedia 
data. For an optimization of the query processing, accurate models for estimating 
the query processing costs are needed. In this paper, we propose a new cost model 
for nearest neighbor queries in high-dimensional space, which we apply to en- 
hance the performance of high-dimensional index structures. The model is based 
on new insights into effects occurring in high-dimensional space and provides a 
closed formula for the processing costs of nearest neighbor queries depending on 
the dimensionality, the block size and the database size. From the wide range of 
possible applications of our model, we select two interesting samples: First, we 
use the model to prove the known linear complexity of the nearest neighbor search 
problem in high-dimensional space, and second, we provide a technique for opti- 
mizing the block size. For data of medium dimensionality, the optimized block 
size allows significant speed-ups of the query processing time when compared to 
traditional block sizes and to the linear scan. 

1. Introduction 

Nearest neighbor queries are important for various applications such as content-based index- 
ing in multimedia systems [10], similarity search in CAD systems [7, 17], docking of mole- 
cules in molecular biology [21], and string matching in text retrieval [1]. Most applications 
use some kind of feature vector for an efficient access to the complex original data. Examples 
of feature vectors are color histograms [20], shape descriptors [15, 18], Fourier vectors [23], 
and text descriptors [14]. According to [3], nearest neighbor search on high-dimensional fea- 
ture vectors may be defined as follows: 

Given a data set DS of points in a ri-dimensional space [0, 1 ]‘^, find the data point NN from DS 
which is closer to the given query point Q than any other point in the DS. More formally: 

NN(Q) = {ee D5|VeG DS: \\~e - Q\\ < \\e - Q\\} . 

A problem of index-based nearest neighbor search is that it is difficult to estimate the time 
which is needed for executing the nearest neighbor query. The estimation of the time, how- 
ever, is crucial for optimizing important parameters of the index structures such as the block 
size. An adequate cost model should work for data sets with an arbitrary number of dimen- 
sions and an arbitrary size of the database, and should be applicable to different data distribu- 
tions and index structures. Most important, however, it should provide accurate estimates of 
the expected query execution time in order to allow an optimization of the parameters of the 
index structure. 

In a previous paper [3], we proposed a cost model which is very accurate for estimating the 
cost of nearest-neighbor queries in high-dimensional space. This cost model is based on ta- 
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bles which are generated using the Montecarlo integration method. Although expensive nu- 
merical integration occurs only in the compile time of the cost estimator, not in the execution 
time (when actually determining the cost of query execution), expensive numerical steps are 
completely avoided in the cost model proposed in this paper. Based on recent progress in 
understanding the effects of indexing high-dimensional spaces [4], we develop a new cost 
model for an index-based processing of nearest neighbor queries in high-dimensional space. 
The model is completely analytical allowing a direct application to query optimization prob- 
lems. The basic idea of the model is to estimate the number of data pages intersecting the 
nearest neighbor sphere by first determining the expected radius of the sphere. Assuming a 
certain location of the query point and a partition of the data space into hyperrectangles, the 
number of intersected pages can be represented by a staircase function. The staircase func- 
tion results from the fact that the data pages can be collected into groups such that all pages in 
a group have the same “skewness” to the query point. Each page in the group is intersected 
simultaneously and, therefore, the cost model results in a staircase function. 

The model has a wide range of interesting theoretical and practical applications. The model 
may, for example, be used to confirm the known theoretical result [24] that the time com- 
plexity of nearest neighbor search in a very high dimensional space is linear. Since the model 
provides a closed formula for the time complexity (depending on the parameters: dimension- 
ality, database size, and block size), the model can also be used to determine the practically 
relevant break-even dimensionality between index-based search and linear scan (for a given 
database size). For dimensionalities below the break-even point, the index-based search per- 
forms better, for dimensionalities above the break-even point, the linear scan performs bet- 
ter. Since the linear scan of the database can also be considered as an index structure with an 
infinite block size, the query optimization problem can also be modeled as a continuous 
block size optimization problem. Since our model can be evaluated analytically and since the 
block size is one of the parameters of the model, the cost model can be applied to determine 
the block size for which the minimum estimated costs occur. The result of our theoretical 
analysis is surprising and shows that even for medium-dimensional spaces, a large block size 
(such as 16kB for 16 dimensions) clearly outperforms traditional block sizes (2 kByte or 4 
KByte) and the linear scan. An index structure built with the optimal block size always 
shows a significantly better performance, resulting in traditional block sizes in lower dimen- 
sional spaces and to a linear scan in very high dimensions. Also, the query processing cost 
will never exceed the cost for a linear scan, as index stmctures typically do in very high di- 
mensions due to a large number of random seek operations. A practical evaluation and com- 
parison to real measurements confirms our theoretical results and shows speed-ups of up to 
528% over the index-based search and up to 500% over the linear scan. Note that a cost mod- 
el such as the one proposed in [3] can hardly be used for a parameter optimization because of 
the numerical component of the model. 

2. Related Work 

The research in the field of nearest neighbor search in high-dimensional space may be divid- 
ed into three areas: index structures, nearest neighbor algorithms on top of index structures, 
and cost models. 

The first index structure focusing on high-dimensional spaces was the TV-tree proposed by 
Lin, Jagadish, and Faloutsos [16]. The basic idea of the TV-tree is to divide attributes into 
attributes which are important for the search process and others which can be ignored be- 
cause these attributes have a small chance to contribute to query processing. The major draw- 
back of the TV-tree is that information about the behavior of single attributes, e.g. their selec- 
tivity, is required. Another R-tree-like high-dimensional index stracture is the SS-tree [25] 
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which uses spheres instead of bounding boxes in the directory. Although the SS-tree clearly 
outperforms the R*-tree, spheres tend to overlap in high-dimensional spaces and therefore, 
the performance also degenerates. In [ 1 3], an improvement of the SS-tree has been proposed, 
where the concepts of the R-tree and SS-tree are integrated into a new index stoucture, the 
SR-tree. The directory of the SR-tree consists of spheres (SS-tree) and hyperrectangles (R- 
tree) such that the area corresponding to a directory entry is the intersection between the 
sphere and the hyperrectangle. Another approach has been proposed in [6]. The X-tree is an 
index structure adapting the algorithms of R* -trees to high-dimensional data using two tech- 
niques: First, the X-tree introduces an overlap-free split algorithm which is based on the split 
history of the tree. Second, if the overlap-free split algorithm would lead to an unbalanced 
directory, the X-tree omits the split and the corresponding directory node becomes a so- 
called supemode. Supemodes are directory nodes which are enlarged by a multiple of the 
block size. Another approach related to nearest neighbor query processing in high-dimen- 
sional space is the parallel method described in [4]. The basic idea of the declustering tech- 
nique is to assign the buckets corresponding to different quadrants of the data space to differ- 
ent disks, thereby allowing an optimal speed-up for the parallel processing of nearest 
neighbor queries. 

Besides the index structures, the algorithms 
used to perform the nearest neighbor search are 
obviously important. The algorithm of Rouso- 
poulos et. al. [19] operates on R-frees. It 
traverses the tree in a top-down fashion, always 
visiting the closest bounding box first. Since in 
case of bounding boxes there always exists a 
maximal distance to the closest point in the box, 
the algorithm can prune some of the branches 
early in the search process. The algorithm, 
however, can be shown to be suboptimal since, 
in general, it visits more nodes than necessary, 
i.e., more nodes than intersected by the nearest 
neighbor sphere. An algorithm which avoids 
this problem is the algorithm by Hjaltason and 
Samet [12]. This algorithm traverses the space 
partitions ordered by the so-called MINDIST 
which is the distance of the closest point in the box to the query point. Since the algorithm 
does not work in a strict top-down fashion, the algorithm has to keep a list of visited nodes in 
main memory. The algorithm can be shown to be time-optimal [3]; however, in high-dimen- 
sional spaces, the size of the list, and therefore the required main memory, may become pro- 
hibitively large. 

The third related area are cost models for index-based nearest neighbor queries. One of the 
early models is the model by Friedman, Bentley, and Finkel [11]. The assumptions of the 
model, however, are unrealistic for the high-dimensional case, since N is assumed to con- 
verge to infinity and boundary effects are not considered. The model by Cleary [9] extends 
the Friedman, Bentley, and Finkel model by allowing non-rectangular-bounded pages, but 
still does not account for boundary effects. Sproull [22] uses the existing models for optimiz- 
ing the nearest neighbor search in high dimensions and shows that the number of data points 
must be exponential in the number of dimensions for the models to provide accurate esti- 
mates. A cost model for metric spaces has recently been proposed in [8]. The model, howev- 
er, has been designed for a specific index-sfructure (the M-free) and only applies to metric 




Figure 1: Two-dimensional Example 
for the Data Pages Affected 
by the Nearest Neighbor 
Search for an Increasing NN- 
distance 
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spaces. In [3], a cost model has been proposed which is very accurate even in high-dimen- 
sions as the model takes boundary effects into account. The model is based on the concept of 
the Minkowsky-sum which is the volume created by enlarging the bounding box by the que- 
ry sphere. Using the Minkowsky-sum, the modeling of the nearest neighbor search is trans- 
formed into an equivalent problem of modeling point queries. Unfortunately, when bound- 
ary effects are considered, the Minkowsky-sum can only be determined numerically by 
Montecarlo integration. Thus, none of the models can be used to optimize the parameters of 
high-dimensional indexing techniques in a query processor. 



3. A Model for the Performance of Nearest Neighbor Queries in 
High-Dimensional Space 



As a first step of our model, we determine the expected distance of the query point to the 
actual nearest neighbor in the database. For simplification, in the first approximation we as- 
sume uniformly distributed data' in a normalized data space [0, l]'^ having a volume of 1 . The 
nearest neighbor distance may then be approximated by the volume of the sphere which on 
the average contains one data point. Thus, 



where r(n) is the gamma function (r(x-l-l) =x-T{x), r(l) = 1 and 
r(l/2) = Jtz ), which may be approximated by r(n) ~ («/e) • J2 ■ n ■ n . From the 
above equation, the expected nearest neighbor distance may be determined as 



NN-dis,(N.i)- jr(j/2^i) 

“■1 '' " 

In general, the number of data points is not growing exponentially, which means that not all 
dimensions are used as split axes. Without loss of generality, we assume that the first d" di- 
mensions have been used as split dimensions. Thus, d’may be determined as 




For simplification, we further assume that each of the d’’ split dimensions has been split in 
the middle. 

To determine the number of pages which are intersected by the nearest neighbor sphere, we 
now have to determine the number of pages depending on NN-dist(N,d). In our first approxi- 
mation, we only consider the simple case that the query point is located in a comer of the data 
space^. In figure 1, we show a two-dimensional example for the data pages which are affect- 
ed by the nearest neighbor search for an increasing NN-dist(N, d). Since the data space is as- 
sumed to be normalized to [0, 1]‘' and since the data pages are split at most once, we have to 
consider more than one data page if NN-dist{N, d) > 0.5 .In this case, two additional data 
pages have to be accessed in our two-dimensional example (cf figure 1). If NN-dist(N, d) 
increases to more than 0.5 ■ V2 , we have to consider even more data pages, namely all four 



1 . In section 5, we provide an extension of the model for non-uniform data distributions. 

2. In high-dimensional space, this assumption is not unrealistic since most of the data points 
are close to the surface of the data space (for details see [5]). 
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pages in our example. In the general case, we have to consider more data pages each time the 
exceeds the value 0.5 ■ . We therefore obtain 

NN-dist(N, d) ^ 0.5 ■ Ji (fori = 1 ... d') 



<=> 



(NN-dist{N, d) 
' ' 0.5 



f 1 jr(d/2 + i)5^ 
7^' a/ n 



V 9 



(for i = 1 ... d’) 



(for i = 1 ... d’) 



2-(d+2) l n-(d+2f 
^ 4 • 



(for i = 1 ... d’) 



The number of data pages which have to be considered in each step are the data pages that 
differ in i of the d’ split dimensions, which may be determined as V 



d 


dimension 


N 


number of data points 


\db\ 


size of the database 


#b 


number of data pages in the database 


\b\ 


page size 


u 


storage utilization 


d’ 


number of split dimensions 


Qff 


average number of data points per index page 


^10 


I/O time (disc access time independent from the size 
of the accessed data block) 


T’xr 


transfer and processing time (linearly depends on the 
size of the accessed data block) 


NN-dist(|tS)|, d) 


nearest neighbor distance depending on \db\ and d 


P(N,d) 


number of data pages depending on N and d 


pmA) 


number of data pages depending on\db \ and d 


X(#b) 


percentage by which the linear scan is faster 



Table 1: Important Symbols 
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Integrating the formulas provides the number of data pages which have to accessed in per- 
forming a nearest neighbor query on a database with N data points in a (^-dimensional space : 



2-{d+2) l n-(d+2f 
® ^ 4- -N^ 



P(N, d) = 



I 




(t = 0 

In the following, we determine the development of P{N, d) for a constant size database 
{\db\ = const) . In this case, the number of data pages is also constant 



#b 



\db\ 
\b\ ■ u 



and the number of data points linearly depends on the database size \db\ and the dimension- 
ality (i 




The nearest neighbor distance NN-distnovi depends on the dimensionality {d) and the size of 
the database {\db\) 



NN-dist{\db\, d) = d H ' S 

Vtt V \db\ 

and since = \b\ ■ M/(i , the number of split dimensions (i’ becomes 




The number of data pages which have to be accessed in performing a nearest neighbor query 
in (i-dimensional space can now be determined as 



P{\db\, d) 



2 ■ {d+2) I n ■ d^ ■ (d+ 2)^ 

4-e^-ldbl^ 



I 







k=o V k ) 

In figure 2, we show the development of P{\db\, d) depending on the dimensionality d. 
P{\db\, d) is a staircase function which increases to the maximum number of data pages 
{#b). The staircase property results from the discontinuous increase oiP{\db\, d) each time 



NN-dist(N, d)>0.5 ■ Ji (for cP) , 

which is a consequence of using the comers of the data space as query points. Since for prac- 
tical purposes this assumption is not sufficient, in section 5 we extend the model to consider 
other query points as well. 



4. Evaluation 

After introducing our analytical cost model, we are now able to compare the time needed to 
perform an index-based nearest neighbor search with the sequential scan of the database. Let 
us first consider the factor by which a sequential read is faster than a random block-wise ac- 
cess. Given a fixed size database (\db\) consisting of #b = \db\ / (\b\ ■ u) blocks, the time 
needed to read the database sequentially is Tjq + #b ■ Tj,^ . Reading the same database in a 
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Figure 2: Number of Page Accesses as Determined by our Cost Model 

block-wise fashion requires #b ■ ( Tjq + Tfr) . From that we get the factor X, by which the 
sequential read is faster, as 

#b-{TjQ+Tj^) 

X(#b) ■ #b ■ {Tjo + = Tjo + #b ■ => X(#b) = ^ 

^ lO ^ Tr 

In figure 3a, we show the development ofX(#b) depending on the size of the database for 
realistic system parameters which we measured in our experiments (Tjq = 10 ms, 
Trr- 1.5 ms). Note that for very large databases, converges against a system constant 

Tio + Tj, 

lim X{#b) - \ . 

#b ^ ^ fy 

It is also interesting to consider the inverse ofX(#b), which is the percentage of blocks that 
could be read randomly instead of sequentially reading the whole database. The function 
1 / {X{ #b)) is plotted in figure 3b for a constant database size. 

Let us now compare the time needed to perform an index-based nearest neighbor search with 
the time needed to sequentially search the database. The time needed to read the database 
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Figure 3: Development ofX(#b) and 1/X(#b) 
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Figure 4: Comparison of Index-based Search and the Sequential Scan Depending 
on the Dimensionality d for a Fixed-Size Database 



sequentially is Tj^+ #b ■ {Tj^+ , and the time needed to randomly access the nec- 

essary pages in an index-based search is 



2 ■ (d+2) 



'K ■ ■ (d+ i f 



\db\' 



'^IndexSearch^^^^^’ (^70 



7 = 0 






\db\ 
\b\ ■ i 
k 



In figure 4, the time resulting from the two formulas is shown for a fixed database size of 
256,000 blocks. Note that in the example for <i= 44, the linear scan becomes faster than the 
index-based search. In figure 5, we show the time development depending on both, the di- 
mensionality and the database size. It is clear that for all realistic parameter settings, there 
exists a dimensionality d for which the linear scan becomes faster than the index-based 
search. 

This fact is summarized in the following lemma: 

Lemma: ( Complexity of Index-based Nearest Neighbor Search) 

For realistic^ system parameters (i.e., Tio >Q ,\db\>\b\ , and m > 0 ), there always exists a 
dimension d , for which the sequential scan is faster than an index-based search. More for- 
mally, 

\/\db\ 3~d-. ^ 

Idea of the Proof: 

From our previous observations, it is clear that T d) increases faster than 
TiinScan ^\‘^^\ ) and finally all data blocks of the database are read. This is the case for 



2 ■ jd+2) 
e ■ n 



Ti ■~c^ ■ (d 



+ 2 ) ^ 






\db\ 
\b\ ■ i 



V 4 • e • \db\ 

in which case the index-based search randomly accesses all blocks of the database. A se- 
quential read of the whole database is faster than the block- wise access by a factor of 



X{U) = 



\db\ 



{Tjo+ Tj^) 



\b\-u- Tjq + \db\ ■ T. 



Tr 



and therefore, the correctness of the lemma is shown, q.e.d. 
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Figure 5: Comparison of Index-based Search and the Sequentiai Scan Depending on 
the Database Size \db\ and the Dimensionaiity d 



Note that this result confirms the pessimistic result by Weber et al. [24] that nearest neighbor 
search in high-dimensional space has a linear time complexity. In figure 6, we compare the 
performance estimations of our numerical model [3] and the new analytical model (proposed 
in this paper) to the performance of the R*-tree [2] on a uniformly distributed data set with a 
fixed number of data pages (256) and varying dimensionality (d = 2 ... 50). As one can see, 
the analytical cost model provides an accurate prediction of the R*-tree ’ s performance over a 
wide range of dimensions. Even for low and medium dimensions, the prediction of our mod- 
el is pretty close to the measured performance. 




“ NutnunMl Mfwirt 

“ Analytical Model 



Figure 6: Comparison of Cost Models and Measured R*-Tree Performance De- 
pending on the Dimension 
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5, Optimized Processing of Nearest Neighbor Queries in High- 
Dimensional Space 

The objective of this section is to show, how the results obtained above can be used for an 
optimization of the query processing. We have shown in section 4 that data spaces exist, 
where it is more efficient to avoid an index-based search. Instead, the sequential scan yields a 
better performance. In contrast, for low-dimensional data spaces, multidimensional index 
structures yield a complexity which is logarithmic in the number of data objects. In this sec- 
tion, we show that it is optimal for data spaces with moderate dimensionality to use a logical 
block size which is a multiple of the physical block size provided by the operating system. 
Therefore, we have to consider three different cases of query processing: (1) Index-based 
query processing with traditional block size (4 KByte) in low-dimensional data spaces, (2) 
Sequential Scan processing in high-dimensional data spaces and (3) Index-based query pro- 
cessing with enlarged block-size in medium-dimensional data spaces. As the sequential scan 
can be considered as an index with infinite block size, the third case subsumes case (2). Case 
(1) is trivially subsumed by case (3). Therefore, the problem is reduced to a single optimiza- 
tion task of minimizing the access costs by varying the logical block size. 

In order to obtain an accurate estimation of the minimum-cost blocksize, especially in the 
presence of non-uniform data distributions, we have to slightly extend our model. Hence, we 
express the expected location of the query point more accurately than assuming the query 
point to be in a comer of the data space. If we assume that the location of the query point is 
uniformly distributed, we are able to determine the expected distance Edist of the query point 
to the closest comer of the data space. However, the formula for Edist is rather complex and 
therefore not applicable for practical purposes. Instead, we use the following empirically de- 
rived approximation of the formula which is very accurate up to dimension 100: 

jO.53 

Edist = — - — 

4 

Figure 7 compares the exact and the approximated distances demonstrating the good accura- 
cy of the approximation. 




Dimension 



Figure 7: Approximated distance of the query point to the closest corner of data space 
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If we assume that the query point is located on the diagonal of the data space, we can adapt 
our model using the approximate value of Edist. In this case, we have to consider more data 

f r 

pages each time NN-dist(N, d) exceeds the value 0.5 Ji rather than exceeding 

V 4jdJ 

0.5 • in the original model. Thus, our extended cost function turns out as: 



e ■ TZ- 



d+2 K ■ (d+ 2)^ ■ S 

0 5_^Y’a/ 



I 



yt = 0 







) 



Figure 8 depicts the graph of the extended cost model with varying dimension and varying 
block size. As the model is discrete over the dimensions, there are some staircases. On the 
other hand, the model is continuous over the blocksize and therefore, amenable to an optimi- 




Figure 8: Graph of the Extended Cost Function: The staircase in the front is associat- 
ed to dimensions for which a minimum block size yields least cost. The stair- 
case in the hack shows dimensions for which sequential scan causes less 
costs than index search whereas the staircase in the middle shows minimum 
cost for an optimal block size in a medium size range. 
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zation by differentiation. In Figure 8, all three eases ean be seen: For low dimensions (stair- 
ease in the front), the eost frinetion is monotonously inereasing with inereasing bloek size. 
Therefore, the lowest possible bloeksize should be taken. In the high-dimensional ease (stair- 
ease in the baek), the eost frinetion is monotonously deereasing with inereasing bloek size. 
Therefore, the optimal bloeksize is infinite, or in other words, we are supposed to use sequen- 
tial sean instead of indexing. 

The most interesting phenomenon is the stairease in the middle. Here the eost frinetion is mo- 
notonously falling to a minimum and then monotonously inereasing. The eost frinetion has 
obviously a single minimum and no further loeal extrema, whieh faeilitates the seareh for the 
optimum. In order to find the bloek size for whieh the minimum eost oeeur, we simply derive 
the eost flinetion with respeet to the bloek size. The derivative of the eost flinetion 
^indexSearch (\db\, d, |Zi|) ean be determined, as follows: 



^^lndexSearch(I^^U^I^I) 

31^1 



+ 2 






^.53^2 

0.5-^ 



^ n- {d+ 2)^ ■ ifi 
4 ■ \db\'^ ■ 









k = 0 









e ■ K- 



d + 2 

? j0-53\2 ■ 



31^1 



k=0 



j n- {cl+ 2)^ ■ cP- 
4 ■ \db\^ ■ 






V 






V 







■ 71 ■ 



d+2 j n ■ (d+ 2)^ ■ cfi 



0.5 -- 



4jd- 

I 






k = 0^ 



'¥ 



with % = 



(^Tr + (^IO+l^|-^Tr)-%) 



\b\ ■ u 



+ 1 



\b\ ■ ln(2) 

digamma funetion, the derivative of the natural logarithm of the F -flinetion: 



, where T' is the well-known 



T'(x) = 1^ ln(F(v)) . 

The derivative of Tj^^jg^Search 's eontinuous over \b\. Three eases ean be distinguished: (1) 
The derivative is positive over all bloek sizes. In this ease, the minimum bloek size is opti- 
mal. (2) The derivative is negative. Then, an infinite bloeksize is optimal and the seareh is 
proeessed by sequentially seanning the only data page. (3) The derivative of the eost flinetion 
has a zero value. In this ease, the equation 

^^IndexSearchd^^l-^- I^D ^ „ 
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Figure 9: Optimal Block Size for Varying Dimensions 



has to be solved, determining the optimal blocksize = l^lopt- Finding the optimal blocksize 

ean be done by a simple binary seareh. As the cost function is smooth, this will lead to an 
optimal blocksize after a small (logarithmic) number of steps. 

The development of the optimal block size over varying dimensions and varying database 
sizes is depicted in Figure 9. The position of the optimum is, as expected, heavily affected by 
the dimension of the data space. At dimensions below 9, it is optimal to use the minimal 
block size of the system, in this case 1KByte. In contrast, when searching in a 17-dimension- 
al data space (or higher), the optimal block size is infinite. Therefore, the sequential scan 
yields the best performance in this case. The position of the optimum also depends on the 
database size. Figure 10 shows the optimal block size with varying the database size. Al- 
though the variation is not very strong, the optimal block size is decreasing with increasing 
the database size. We therefore suggest a dynamical adaption of the block size if the database 
size is unknown a priori. 

Finally, we are going to evaluate the accuracy of our cost model. For this purpose, we created 
indexes on real data containing 4, 8, and 16-dimensional Fourier vectors (derived from CAD 
data, normalized to the unit hypercube). The size of the databases was 2.5 and 1.0 MBytes. 
Figure 1 1 shows the results of these experiments (total elapsed time in seconds), revealing 
the three different cases of query processing mentioned in the beginning of this section. The 
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Figure 10: Optimal Block Size for Varying Sizes of the Database 
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left side shows the results of the 4-dimensional data space, where normal index based query 
processing with small block sizes is optimal. To the right, we have the high-dimensional 
case, where the infinite block size (sequential scan) is optimal. The interesting case (8-di- 
mensional data space) is presented in the middle. Here, the performance forms an optimum 
at a block size of 32KBytes, which is very unusual in database applications. The query pro- 
cessing software using this logical block size (16 contiguous pages of the operating system) 
yields substantial performance improvements over the normal index (258%) as well as over 
the sequential scan (205%). The maximal speedup reached in this series of experiments was 
528% over normal index processing and 500% over the infinite block size (sequential scan). 



6. Conclusions 

In this paper, we propose a new analytical cost 
model for nearest neighbor queries in high-di- 
mensional spaces. The model is based on re- 
cent insights into the effects occurring in 
high-dimensional spaces and can be applied 
to optimize the processing of nearest neighbor 
queries. One important application is the de- 
termination of an optimal block size depend- 
ing on the dimensionality of the data and the 
database size. Since the linear scan can be 
seen as a special configuration of an index 
structure with an infinite block size, an index 
structure using the optimal block size will per- 
form in very high dimensions as good as a lin- 
ear scan of the database. For a medium di- 
mensionality, the optimal block size leads to 
significant performance improvements over 
the index-based search and the linear scan. An 
experimental evaluation shows that an index 
structure using the optimal block size outper- 
forms an index structure using the normal 
block size of 4 KBytes by up to 528%. 
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